IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

2291

Matrix Variate Distribution-Induced Sparse Representation for Robust Image Classification Jinhui Chen, Jian Yang, Lei Luo, Jianjun Qian, and Wei Xu Abstract— Sparse representation learning has been successfully applied into image classification, which represents a given image as a linear combination of an over-complete dictionary. The classification result depends on the reconstruction residuals. Normally, the images are stretched into vectors for convenience, and the representation residuals are characterized by l 2 -norm or l 1 -norm, which actually assumes that the elements in the residuals are independent and identically distributed variables. However, it is hard to satisfy the hypothesis when it comes to some structural errors, such as illuminations, occlusions, and so on. In this paper, we represent the image data in their intrinsic matrix form rather than concatenated vectors. The representation residual is considered as a matrix variate following the matrix elliptically contoured distribution, which is robust to dependent errors and has long tail regions to fit outliers. Then, we seek the maximum a posteriori probability estimation solution of the matrix-based optimization problem under sparse regularization. An alternating direction method of multipliers (ADMMs) is derived to solve the resulted optimization problem. The convergence of the ADMM is proven theoretically. Experimental results demonstrate that the proposed method is more effective than the state-of-the-art methods when dealing with the structural errors. Index Terms— Alternating direction method of multipliers (ADMMs), elliptically contoured distribution, matrix distribution, sparse representation.

I. I NTRODUCTION

R

ECENTLY, sparse representation has attracted broad interests in computer vision and pattern recognition. Since it has been demonstrated that natural image can be generally represented by structural primitives [1], sparse representation aims to reconstruct a natural image using a small set of elements parsimoniously chosen out of an over-complete dictionary. Many new techniques have been developed that take the advantage of parsimony to achieve state-of-the-art algorithms in compress sensing, signal processing, and image classification. After the exciting breakthroughs in the application of compress sensing and signal processing,

Manuscript received February 12, 2014; revised August 11, 2014; accepted November 27, 2014. Date of publication February 12, 2015; date of current version September 16, 2015. This work was supported in part by the National Science Fund for Distinguished Young Scholars under Grant 61125305, Grant 61472187, Grant 61233011, and Grant 61373063, in part by the Key Project through the Ministry of Education, China, under Grant 313030, in part by the 973 Program under Grant 2014CB349303, in part by the Fundamental Research Funds for the Central Universities under Grant 30920140121005, and in part by the Program for Changjiang Scholars and Innovative Research Team in University under Grant IRT13072. The authors are with the Department of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2377477

Wright et al. [2] exploited the intrinsic discrimination of sparse representation to perform classification. Instead of using structural elementary atoms, sparse representation classification (SRC) represents a given probe in an over-complete dictionary whose elements are the training data themselves. If the training samples from each class are sufficient, pursuing the sparsest representation automatically discriminates between various classes. For a given query image, which is normally stretched into a vector as an input signal, sparse representation codes the input image using a small number of the dictionary. The sparsity of the coding coefficient is measured by l1 -norm. In general, the sparse representation classifier can be formulated as follows: min y − Ax2 + x1

(1)

where y is a given signal, A is the dictionary, x is the coding coefficient of y over dictionary A. The classification result depends on the reconstruction of nonzero coefficients in each class. SRC classifies the signal y into the class which results the minimal reconstruction error. Although SRC achieves big success in image classification, there are two main issues to be studied more carefully. One is that whether l1 -norm is an appropriate strategy to characterize the sparsity. The other is that whether the Frobenius norm is effective enough to describe the structural errors in image representation. In fact, there are many works aiming to find a more accurate strategy to characterize the sparsity. For instance, Liu et al. [3] brought in a nonnegative constraint to the sparse representation x. Wang et al. [4] introduced a weighted l2 -norm to improve the sparse constraint. The Bayesian methods were also used for designing the sparsity regularization terms [5]. Meanwhile, the sparse representation has been widely applied into pattern recognition, such as feature extraction [6], coarse data recognition [7], and largescale face recognition [8]. Recently, some researchers are sceptical about whether the sparsity constraint is necessary in face recognition. Zhang et al. [9] argued that the success of SRC is attributed to collaborative representation rather than sparsity. Consequently, they proposed a collaborative representation classifier, which is similar to ridge regression. However, it can be told from experiments [9] that sparsity is important when the dimension of the data is much more smaller than the number of training data. In the aforementioned methods, they employed the Frobenius norm [10], [11] to characterize the reconstruction residual. Considering that the Frobenius norm is vulnerable to large outliers, Lu et al. [12] adopted the l1 -norm to make the residual estimation more reliable. From the viewpoint of

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2292

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

2 2 Fig. 1. Illustration of different multivariate distribution. (a) 2-D-independent Gaussian distribution f (x, y) = C g e−(x +y ) . (b) 2-D-independent Laplace 2 2 1/2 distribution f (x, y) = Cl e−(|x|+|y|). (c) 2-D-dependent Laplace distribution f (x, y) = Ce e−(x +y ) . C g , Cl , and Ce are the positive proportionality constants.

maximum likelihood estimation, these two norms are based on the hypothesis that the noises are independent identically distributed (i.i.d.) with Gaussian distribution or Laplace distribution. Yang et al. [13] modeled sparse coding as a robust regression problem and proposed an adaptive distribution to characterize the noises. However, those traditional methods usually stretch the image into vector before classification, which ignores the structural informations in original data. Consequently, the noises in the image data are assumed to be i.i.d.. This i.i.d. hypothesis may not be accurate enough in real scenarios. In fact, noises such as illuminations, occlusions, and expressions in human face are not i.i.d.. Inspired by the idea of [14] and [15], it is straightforward to consider the noise in an image as a matrix variate, which follows some matrix distribution. It is critical to find a proper matrix variate distribution to characterize the error matrix. We hope the new distribution could preserve the heavy tail, which makes Laplace distribution robust to multiple errors [16]. As each element of the error matrix is assumed to be i.i.d. in traditional methods, the distribution of the errors is simply the multiplication of the distribution of each element, which are shown in Fig. 1(a) and (b). It is shown that both independent and dependent 2-D-Laplace distribution have heavier tail than 2-D-Gaussian distribution. As to the Laplace distribution, we prefer Fig. 1(c) rather than Fig. 1(b). Compared with the multiplication of Laplace distribution which is a squarely contoured distribution, the elliptically contoured distribution is more robust in real scenarios [17]. The distribution shown in Fig. 1(c) is known as multivariate generalized Laplace distribution [18] or Kotz-type distribution [19], which has fatter tail regions and hence can be useful in providing robustness against outliers. Obviously, multivariate Laplace distribution belongs to the family of elliptically contoured distribution [17], [19], which contains multivariate versions of the Gaussian, Laplace, and uniform distributions. In this paper, we consider the image representation problem based on the opinion of Bayesian inference by assuming the error matrix in image data is a random matrix following a special case of elliptically contoured distribution, which can be considered as a matrix version of Laplace distribution, and the coding coefficient is following the Laplace distribution. Then, we seek the maximum a posteriori solution of the sparse coding problem. The objective function turns out to be a nuclear norm characterized matrix regression under sparse penalty. Our approach to solve this optimization

problem is motivated by alternating direction method of multipliers (ADMMs). As the objective function involves, nuclear norm and l1 -norm, and ADMM has been successfully applied to optimize these two convex problems, we adopt ADMM algorithm to solve the presented optimization problem. A. Contributions In summary, the contributions of this paper include the following three aspects. 1) We introduce the matrix variate elliptically contoured distribution to image representation by assuming that the error matrix in image data is distributed with a special case of matrix elliptically contoured distribution. This hypothesis turns out to characterize the error matrix using nuclear norm, which confirms that nuclear norm is a better technique to describe dependent errors. 2) A matrix-based framework for image representation is presented, by adopting the nuclear norm to describe the reconstruction residual, which is more robust to the degenerated, dependent noises, and preserves more spatial informations. In this setting, the highdimensional data can be represented in their intrinsic form rather than concatenating the objective data in a single vector. 3) We employ ADMMs to solve the matrix-based image representation problem. ADMM algorithm decomposes the original problem into two subproblems. The two subproblems are updated alternatively to the optimal solution. The convergence of the proposed algorithm is proven theoretically in the Appendix. B. Paper Organization The remainder of this paper is organized as follows. We introduce the matrix variate elliptically contoured distribution in Section II. The matrix-based framework for image representation is also presented in Section II. In Section III, we design an ADMM algorithm to solve the proposed optimization problem. The computational complexity and convergence analysis are presented in Section IV. In Section V, we report the experimental results on different applications. Finally, the conclusion is drawn in Section VI. II. M ATRIX -BASED I MAGE R EPRESENTATION A. Matrix Elliptically Contoured Distribution The classical univariate Laplace distribution is often used for modeling errors with heavier than normal tails. As it has

CHEN et al.: MATRIX VARIATE DISTRIBUTION-INDUCED SPARSE REPRESENTATION

been shown in Fig. 1, when dealing with some structural errors, it is not robust to assume that each element in error matrix is i.i.d. with Laplace distribution. Therefore, we prefer modeling the error matrix using matrix variate elliptically contoured distribution. Matrix variate elliptically contoured distribution is a class of probability distribution allowing heavy tails and dependence between elements, which makes it attractive in modeling structural errors. In probability theory, the characteristic function is an alternative way for describing random variables. For a random vector x ∈ R n , the characteristic function is ϕ x ( p) = E[exp(i p T x)], where E[·] is the expectation operator. Now, let us introduce multivariate elliptically contoured distribution. Definition 2.1 (Multivariate Elliptical Distribution): Let x ∈ R n be a random vector whose distribution is absolutely continuous. Then, we say x follows the multivariate elliptically contoured distribution if and only if the characteristic function of x has the form ϕ x ( p) = exp(i p T μ)ψ( p T p)

(2)

μ ∈ ψ : R → R,  ∈ is a where p ∈ symmetric positive matrix. It has been shown that elliptical distribution can also be defined for random matrix [17], which is a generalization of multivariate elliptically contoured distribution. Definition 2.2 (Matrix Elliptical Distribution [17]): Let X be a m × n random matrix whose distribution is absolutely continuous. Then, we say X follows the elliptically contoured distribution if and only if the characteristic function of X has the form Rn ,

Rn ,

R n×n

ϕ X (P) = etr(i P T M)ψ(P T  P)

(3)

where P ∈ R m×n , M ∈ R m×n , ψ : R n×n → R,  ∈ R m×m is a symmetric positive matrix and etr(·) is short for exp(tr(·)). This distribution is denoted by X ∼ E m,n (M, , ψ). In fact, the matrix variate elliptically contoured distributions can also be defined in terms of their probability density functions f (X) = Ch((X − M)T  −1 (X − M))

(4)

→ R, C is a positive proportionality where h : constants, h and ψ determine each other if m and n are fixed. Matrix variate elliptically contoured distribution is widely used to describe a random matrix whose elements are not independent [20]. They do not even have to be uncorrelated, a more general covariance structure is allowed under this distribution. In addition, it is a broad class of distribution, which is a generalization of matrix variate Gaussian, Laplace, uniform distributions. So, how to select a proper h(·) to characterize the error matrix is critical to our model. In order to preserve the robustness to outliers of Laplace distribution, we hope the new distribution also has a long tail. Considering that h(·) = exp(−(·)1/2 ) is chosen in multivariate generalized distribution [19], we select h(·) = etr(−(·)1/2 ) to specify the matrix variate distribution. Then, the density function can be written as follows: R n×n

1

f (X) = Cetr(−((X − M)T  −1 (X − M)) 2 )

(5)

2293

TABLE I Z ERO M EAN L APLACE D ISTRIBUTIONS

when X is a vector, the distribution will degenerate to multivariate Laplace distribution, and it will be normal Laplace distribution when X is univariate. So, the distribution (5) can be viewed as matrix variate Laplace distribution. The resemblance can be told from Table I. B. Model of Image Representation Since there are plenty of dependent noises in image data, such as illumination, occlusions, and expressions, it seems matrix variate elliptically contoured distribution is a good option for image representation. As same as the SRC, we represent an image data Y ∈ R m×n using a given dictionary. The element of the dictionary is matrix: A = {A1 , A2 , . . . , Ak }, Ai ∈ R m×n Y = A1 x 1 + A2 x 2 + · · · + Ak x k + E.

(6)

We denote A(x) = A1 x 1 + A2 x 2 + · · · + Ak x k . Without loss of generality, we assume the representation residual E = Y − A(x) follows the distribution E m,n (O, In , ψ). According to the sparse theory, we assume the coding coefficient over matrix-based dictionary A follows the Laplace distribution. Based on the maximum a posteriori probability estimation, we have x = arg max ln P(x|Y ) = arg max ln P(Y |x) + ln P(x). x

x

(7)

For E = Y − A(X) ∼ E m,n (O, In , ψ) and each element x i of coding coefficient x is i.i.d. with Laplace distribution, we have P(Y |x) = Cetr(−((Y − A(x))T (Y − A(x)))1/2) and  P(x) = ki=1 exp(−|x i |/α)/2α, which make the objective function equal to 1

min tr((Y − A(x))T (Y − A(x))) 2 + x

k 1 i |x |1 . α

(8)

i=1

We symbolize λ = 1/α and we know nuclear norm X∗ = tr(X T X)1/2 . Therefore, the above equation can be written as min Y − A(x)∗ + λx1. x

(9)

Obviously, the optimization problem is convex and can be solved by various methods, we will discuss how to solve it in Section III. III. A LGORITHM OF THE P ROPOSED M ODEL In this section, we exploit ADMM to solve the optimization problem (9). First, the objective function can be formulated as min E∗ + λx1 s.t. E = Y − A(x) E,x

(10)

where the variable E and x are separate in the objective function, and coupled only in the constraint. This problem

2294

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

can be solved by ADMMs, which is equivalent to minimizing the following augmented Lagrange function: L(E, x, Z ) = E∗ + λx1 + tr(Z T (A(x) + E − Y )) μ + A(x) + E − Y 2F 2 where μ is a penalty parameter, Z is the Lagrange multipliers. ADMM utilizes the separability of (10) and substitutes the joint minimization over E and x with two simpler subproblems. As the augmented Lagrange function is an unconstrained problem, it can be minimized with respect to E, x, Z alternatively. In the (i + 1)th ADMM procedure, the (E i+1 , x i+1 , Z i+1 ) are iteratively updated as follows: E i+1 = arg min L(E, x i , Z i )

(11)

x i+1 = arg min L(E i+1 , x, Z i )

(12)

Z i+1 = Z i + μ(A(x i+1 ) + E i+1 − Y ).

(13)

E x

Algorithm 1 Matrix-Based Representation Input: Data Y ∈ R m×n , Parameter λ > 0, Dictionary A1 , . . . Ak ∈ R m×n . Initialization: Z = 0, μ = 10−6 , τ = 0.25, ρ = 1.1, μmax = 106 , and ε E = εx = 10−6 Output: Representation x 1: while not convergence do 2: Fix the others and update E E i+1 = S 1 (Y − A(x i ) − μ

3:

4:

The above problem is normally solved via singular value thresholding (SVT) operator [21], which is elaborated in Definition 3.1 and Theorem 3.1. Definition 3.1 (SVT Operator [21]): Consider the singular value decomposition of a matrix X ∈ R m×n of rank r

7:

X = U V T,  = diag({σi }1≤i≤r ) where U and V are, respectively, m × r and n × r matrices with orthonormal columns, and the singular values σi > 0, ∀1 ≤ i ≤ r . For each δ ≥ 0, the SVT operator Sδ is defined as follows: Sδ = U Sδ ()V T, Sδ () = diag((σi − δ)+ ) where (σi − δ)+ = max(0, σi − δ). Theorem 3.1 [21]: For each δ ≥ 0 and Y ∈ R m×n , the SVT operator obeys 1 Sδ (Y ) = arg min X − Y 2F + δX∗. X 2 Optimization of x: Given (E i , x i , Z i ), we obtain the x i+1 by the proximal gradient method [22], [23]. The objective function of (12) can be written as  2 μ Zi   . min λx1 +  − Y + (15) A(x) + E i x 2 μ F In fact, A(x) can be written as Ax in the context of Frobenius norm, where A = [Vec(A1 ), . . . , Vec(Ak )] and Vec(·) is an operator converting matrix into vector. Denote h = Vec(Y − E i − Z i /μ), (15) is equivalent to μ min λx1 +  Ax − h2F. (16) x 2

(18)

Fix the others and update x. Let h = V ec(Y − E i+1 − Zμi ), and A = [V ec(A1 ), . . . , V ec(Ak )] x i+1 =

Noting that (13) is a proximal minimization step of the Lagrange multipliers Z , (11) and (12) are the critical optimizations. Fortunately, they are convex and can be iteratively solved. Optimization of E: Fixing x and Z , the objective function of (11) can be reformulated as   Z 2 μ (14)  . min E∗ + E − Y − A(x) − E 2 μ F

Zi ) μ

x i − τ gi λτ max{| x i − τ gi | − , 0} | x i − τ gi | μ

(19)

where gi = AT ( Ax i − h) Update the multiplier Z i+1 = Z i + μ(A(x i+1 ) + E i+1 − Y )

5: 6:

Update the parameter μ by μ = min(ρμ, μmax ) Check the convergence conditions A(x i+1 ) + E i+1 − Y ∞ < ε E x i+1 − x i  F < εx

(20) (21)

end while

However,  Ax − h2F in (16) can be approximated by its first order of Taylor expansion at x i [24]. Instead of solving the above optimization problem exactly, we solve the following approximation [22]:   1 2 x − x i  min λx1 + μ tr(gi , x − x i ) + (17) x 2τ where τ > 0 is a proximal parameter, and gi = AT ( Ax i − h) is the gradient of the quadratic term in (16). The proximal gradient method is normally used in the context of a composite function with a smooth part and a nonsmooth part, where the gradient is computed only based on the smooth part. The solution is given by the following theorem. Theorem 3.2 [22]: The optimal solution of (17) is attained by   x i − τ gi λτ max |x i − τ gi | − ,0 . x i+1 = |x i − τ gi | μ In summary, the ADMM for the matrix-based representation problem involves two subproblems: 1) the SVT and 2) the proximal gradient method. The algorithm is detailed as algorithm 1. Given sufficient training samples {A1 , A2 , . . . , Ak } from different classes, any new testing sample Y ∈ R m×n can be approximated by the linear span Y = A1 x 1 + A2 x 2 + · · · + Ak x k , where x i is equal to zero when Y is not in the same class with Ai . Under this strategy, we can construct a classifier as follows.

CHEN et al.: MATRIX VARIATE DISTRIBUTION-INDUCED SPARSE REPRESENTATION

2295

For each class i , let θi : R n → R n be the characteristic function that select the coefficients associated with the i th class. For any x ∈ R n , θi (x) is a new vector whose nonzero entries are the entries in x that are associated with class i . Let (x ∗ , E ∗ ) be the optimal solution of (9). Using only the coefficients associated with the i th class, one can approximate the testing sample Y with A(θi (x ∗ )). We then classify the testing sample Y based on these approximations by assigning it to the class which minimizes the nuclear norm of the reconstruction residuals min Y − A(θi (x ∗ ))∗. i

(22)

IV. C OMPLEXITY AND C ONVERGENCE In this section, we analysis the computational complexity and the convergence property of the proposed algorithm. A. Complexity As we claimed before, Y ∈ R m×n , Ai ∈ R m×n , i = 1, 2, . . . , k. For the sake of convenience, we assume m > n. The main computation of Algorithm 1 is Step 2. It requires a singular value decomposition of an m ×n matrix, which means the computational complexity of Step 2 is O(mn 2 ). In Step 3, the main computation lies in the procedure of obtaining the gradient which can be solved with a complexity of O(kmn). Considering the number of iterations needed to converge, the complexity of Algorithm 1 is O(d(mn 2 + kmn)), where d is the number of iterations. B. Convergence The convergence of Algorithm 1 is guaranteed by the following theorem. Theorem 4.1: Let τ > 0 satisfy τ σmax < 1, where σmax denotes the maximum singular value of A, for any fixed μ > 0, the sequence (E i , x i , Z i ) generated by (11)–(13) from any starting point (E 0 , x 0 ) converges to (E ∗ , x ∗ , Z ∗ ), where (E ∗ , x ∗ ) is a solution of (10). The proof of the theorem is given in the Appendix. As it is proven in Theorem 4.1, τ σmax < 1 is the sufficient condition of convergence. If σmax = 1, then the sufficient condition becomes τ < 1. Therefore, we normalized the dictionary A = {(A1 /A1 2 ), (A2 /A2 2 ), . . . , (Ak /Ak 2 )} in advance to ensure σmax = 1. V. E XPERIMENTS In this section, we implement the proposed method on different databases to validate the effectiveness of our method. For the sake of comparison, we also implement Linear Regression Classifier (LRC) [25], SRC [2], Collaborative Representation Classifier (CRC) [9], RSC [13], Correntropy-based Sparse Representation (CESR) [26] in our experiment. A. Experiments on the AR Database The Aleix Martinez and Robert Benavente (AR) face database [27] contains over 4000 images featured frontal view faces with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf) corresponding to 126

Fig. 2.

Samples from the AR human database. TABLE II

R ECOGNITION ON AR D ATA BASE W ITH D IFFERENT M ETHODS (%)

people’s faces (70 men and 56 women). The pictures were taken under strictly controlled conditions. No restrictions on wear (clothes, glasses, and so on), make-up, hair style, and so on were imposed to participants. Each person participated in two sessions, separated by two weeks (14 days) time. The same pictures were taken in both sessions. 120 individuals (65 men and 55 women) participated in both sessions. The images of these 120 individuals are used in our experiment. As it is shown in Fig. 2, we divide the images of one person into four sets: 1) images with different expressions; 2) images under different illuminations; 3) images with sunglasses; and 4) images with scarves. In our experiment, we use the eight images in the first set for training and the other three sets are used as testing sets to verify the robustness of the proposed method to different noises. As it is shown in Table II, comparing with other algorithms, the proposed method has little advantage when dealing with slightly contaminated images. The images in the AR human database are not heavily polluted by the illuminations. Therefore, the proposed method achieves 99.4% recognition rate on the set contaminated by illuminations, which is as same as SRC. When the faces are disguised with sunglasses, which block all the area around eyes, our method shows some improvement. The recognition rate is 97.1% which outperforms other methods. The recognition rates of RSC is 96.7% which ranks second. The improvement is 0.4%. When the faces are disguised with scarves, which block the half area of the faces, the advantage of our method is more distinguished. It achieves 79.6% recognition rate which improves 13.4% compared with SRC which ranks second. The result shows that modeling the error as a matrix variate can capture more information in the error distribution.

2296

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

Fig. 3.

Recognition rate under different parameter λ.

Fig. 4.

Samples from the extended Yale B database. TABLE III R ECOGNITION ON THE E XTENDED YALE B D ATABASE W ITH D IFFERENT M ETHODS (%)

This strategy makes our method more effective in recognition. It is shown in Fig. 3 that the proposed method is not very sensitive to parameter. To different contaminated testing sets, the recognition rates all tend to be stable when the parameter is bigger than 30 and they all achieve their best performance when λ = 60. So, we set the parameter λ = 60 in the following experiment for convenience. B. Experiments on the Extended Yale B Database The extended Yale B database [28] consists of 2432 human face images of 38 classes. Each class contains 64 images taken under different illuminations. Half of the images are corrupted by shadows or reflections. Some examples of one person are shown in Fig. 4. In this section, we design three different experiments. In the first experiment, we select the first 16 images per person for training, and the remaining 48 images for testing. Each image is resized into 48 × 42. The experimental results are shown in Table III. It can be told from Table III that the proposed method outperforms other methods. RSC is a method assuming the noises are i.i.d. with some unknown distribution, which ignores the dependence between the noises. Our method has a improvement of 3.3% compared with RSC, which means

Fig. 5.

Samples of blocked images from the extended Yale B database.

Fig. 6.

Recognition versus different occlusion level.

it is reasonable to consider the dependence between noises. To consider the noises as a matrix variate following elliptically contoured distribution is an effective way to measure the dependence between noises. In the second experiment, We use the repeated randomsampling validation. The database is randomly split into two equal subsets. 32 images of each subject are for training, while the remaining 32 images are used for testing. Then, both the training set and testing set contain 1216 samples. In this way, we run the system 10 times and obtain 10 different couples of training and testing sample sets. As same as the first experiment, each image is resized into 48 × 42. Then, we average the 10 results on the 10 pairs of training and testing sample sets to get the final performance and the standard deviation. The experimental results are shown in Table IV. Table IV shows that the proposed method outperforms the other methods in average. It has a slightly improvement of 0.35% over RSC which ranks second. Compared with the result in the first experiment, this improvement is less significant. This is because the proposed method is robust to dependent errors, which are mainly caused by illuminations in the extended Yale B database. When the training set and testing set are randomly selected, there are plenty of contaminated images in the training set, so the dependent errors become weak when a testing sample is represented by the training set. In the third experiment, we investigate the robustness of the proposed method to different level of contiguous noises. Similar with the setting in [2] and [16], images in subset 1 and 2 of the extended Yale B database are used for training and those in subset 3 for testing, we replace the 10%–50% pixels of each testing image using an unrelated square image. The location of the unrelated image is

CHEN et al.: MATRIX VARIATE DISTRIBUTION-INDUCED SPARSE REPRESENTATION

2297

TABLE IV M EAN R ECOGNITION AND S TANDARD D EVIATION ON THE E XTENDED YALE B D ATABASE W ITH D IFFERENT M ETHODS (%)

Fig. 7. Reconstruction under different occlusion via our method, SRC, RSC, and CESR. The first row is a testing image from the extended Yale B database occluded by 10%, the second row is occluded by 30%, and the third row is occluded by 50%. The first column (left) is the occluded testing image, the second column is the coefficient distribution of the proposed method. In the third column (right) there are the reconstruction image and error image of the proposed method, SRC, RSC, and CESR.

TABLE V R ECOGNITION ON R ANDOMLY B LOCKED THE E XTENDED YALE B D ATABASE (%)

randomly selected. Fig. 5 shows the samples of blocked images from the extended Yale B database with different level of occlusion. The images are resized in to 96 × 84. As LRC and CRC are not robust to occlusions, we only implement SRC, RSC, and CESR in this experiment. Those methods are all based on sparse representation as the proposed method. As it is shown in Table V and Fig. 6, the proposed method is more robust to big occlusions than other methods. When the occluded pixels are 0. Let γ  1/(μ + μδ) > 0 From the Cauchy–Schwartz inequality 2a T b ≥ −(γ a2 + b2 /γ ), (30) implies

CHEN et al.: MATRIX VARIATE DISTRIBUTION-INDUCED SPARSE REPRESENTATION

u i − u ∗ 2G − u i+1 − u ∗ 2G μ 1 ≥ x i − x i+1 2 + z i − z i+1 2 τ μ 1 2 − γ z i − z i+1  −  A(x i − x i+1 )2 γ     μ σmax 1 2 − x i − x i+1  + − γ z i − z i+1 2 ≥ τ γ μ μδ 2 1 δ x i − x i+1 2 + z i − z i+1 2 ≥ τ μ1+δ ≥ ηu i − u i+1 2G (32) where η  min(δ 2 , δ/1 + δ) > 0. It follows that: 1) u i − u i+1 2G → 0; 2) {u i } lies in a compact region; 3) u i − u ∗ 2G is monotonically nonincreasing and thus converges. From (1), there holds x i − x i+1 → 0 and z i − z i+1 → 0 which means Z i − Z i+1 → 0. Considering equation Z i = Z i−1 +μ(A(x i )+ E i −Y ) and Z i − Z i−1 → 0, it follows that A(x i ) + E i − Y → 0. From (2), {u i } has a subsequence {u i j } that converges to uˆ = (x, ˆ zˆ ), i.e., x i j → xˆ and Z i j → Zˆ . ˆ {E i } has a Since A(x i ) + E i − Y → 0 and x i j → x, ˆ subsequence {E i j } approaches a feasible solution E. ˆ ˆ Therefore, ( E, x, ˆ Z ) is a limit point of {(E i , x i , Z i )}. Now ˆ x, we show that ( E, ˆ Zˆ ) satisfies the KKT conditions of ˆ it holds that problem (10). First, by the definition of E, ˆ + Eˆ − Y = 0 lim A(x i j ) + E i j − Y = A(x)

i j →∞

(33)

which satisfies (25). Second, since x i − x i+1 → 0, Z i j → Zˆ and Z i j → Zˆ , ˆ ∗ by taking the limit of (26) over i j , we can get − Zˆ ∈ ∂ E which satisfy (23). Third, we know μ μ AT A(x i+1 − x i ) − AT z i+1 + (x i − x i+1 ) ∈ ∂x i+1 1 τ by taking the formula over i j , we can get − AT zˆ , which satisfies (24). Therefore, we have shown that any limit of {(E i , x i , Z i )} is a optimal solution of (10). Since (32) holds for any optimal ˆ zˆ ) and solution of (10), by letting u ∗ = (x ∗ , z ∗ ) = (x, considering the monotonicity in (3), we get the convergence of {(E i , x i , Z i )}. ACKNOWLEDGMENT The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions. R EFERENCES [1] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23, pp. 3311–3325, Dec. 1997. [2] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

2299

[3] Y. Liu, F. Wu, Z. Zhang, Y. Zhuang, and S. Yan, “Sparse representation using nonnegative curds and whey,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3578–3585. [4] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 3360–3367. [5] X. Lu, Y. Wang, and Y. Yuan, “Sparse coding from a Bayesian perspective,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 929–939, Jun. 2013. [6] M. Yang, L. Zhang, S. C.-K. Shiu, and D. Zhang, “Robust kernel representation with statistical local features for face recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 900–912, Jun. 2013. [7] Y. Xu, Q. Zhu, Z. Fan, D. Zhang, J. Mi, and Z. Lai, “Using the idea of the sparse representation to perform coarse-to-fine face recognition,” Inf. Sci., vol. 238, pp. 138–148, Jul. 2013. [8] R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong, “Two-stage nonnegative sparse representation for large-scale face recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 35–46, Jan. 2013. [9] L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 471–478. [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, Jan. 2010. [11] G. Duan, H. Wang, Z. Liu, J. Deng, and Y.-W. Chen, “K-CPD: Learning of overcomplete dictionaries for tensor sparse coding,” in Proc. Int. Conf. Pattern Recognit., Nov. 2012, pp. 493–496. [12] C. Lu, J. Shi, and J. Jia, “Online robust dictionary learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 415–422. [13] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Robust sparse coding for face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 625–632. [14] Z. Lai, W. K. Wong, Z. Jin, J. Yang, and Y. Xu, “Sparse approximation to the eigensubspace for discrimination,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp. 1948–1960, Dec. 2012. [15] S.-J. Wang, J. Yang, M.-F. Sun, X.-J. Peng, M.-M. Sun, and C.-G. Zhou, “Sparse tensor discriminant color space for face verification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 6, pp. 876–888, Jun. 2012. [16] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Regularized robust coding for face recognition,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1753–1766, May 2013. [17] K.-T. Fang and T. W. Anderson, Statistical Inference in Elliptically Contoured and Related Distributions. New York, NY, USA: Allerton Press, 1990. [Online]. Available: http://books.google.com/ books?id=ow3vAAAAMAAJ [18] M. D. Ernst, “A multivariate generalized Laplace distribution,” Comput. Statist., vol. 13, pp. 227–232, 1998. [19] D. N. Naik and K. Plungpongpun, “A Kotz-type distribution for multivariate statistical inference,” in Advances in Distribution Theory, Order Statistics, and Inference. Berlin, Germany: Springer-Verlag, 2006, pp. 111–124. [20] A. K. Gupta and T. Varga, “A new class of matrix variate elliptically contoured distributions,” J. Italian Statist. Soc., vol. 3, no. 2, pp. 255–270, 1994. [21] J.-F. Cai, E. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4, pp. 1956–1982, 2010. [22] J. Yang and Y. Zhang, “Alternating direction algorithms for 1 -problems in compressive sensing,” SIAM J. Sci. Comput., vol. 33, no. 1, pp. 250–278, 2011. [23] E. T. Hale, W. Yin, and Y. Zhang, “Fixed-point continuation for 1 -minimization: Methodology and convergence,” SIAM J. Optim., vol. 19, no. 3, pp. 1107–1130, 2008. [24] Z. Lin, R. Liu, and Z. Su. (2011). “Linearized alternating direction method with adaptive penalty for low-rank representation.” [Online]. Available: http://arXiv:1109.0367 [25] I. Naseem, R. Togneri, and M. Bennamoun, “Linear regression for face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 11, pp. 2106–2112, Nov. 2010. [26] R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterion for robust face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1561–1576, Aug. 2011. [27] A. Martinez and R. Benavente, “The AR face database,” Computer Vision Center, Bellatera, Spain, Tech. Rep. 24, Jun. 1998. [Online]. Available: http://www.cat.uab.cat/Public/Publications/1998/MaB1998

2300

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 10, OCTOBER 2015

[28] K.-C. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 5, pp. 684–698, May 2005. [29] P. J. Phillips et al., “Overview of the face recognition grand challenge,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1. Jun. 2005, pp. 947–954.

Lei Luo received the B.S. degree from Xinyang Normal University, Xinyang, China, in 2008, and the M.S. degree from Nanchang University, Nanchang, China, in 2011. He is currently pursuing the Ph.D. degree in pattern recognition and intelligence systems with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China. His current research interests include pattern recognition and optimization algorithm.

Jinhui Chen received the B.S. and M.S. degrees in mathematics from the Nanjing University of Science and Technology, Nanjing, China, in 2007 and 2009, respectively, where she is currently pursuing the Ph.D. degree in pattern recognition and image processing.

Jian Yang received the B.S. degree in mathematics from Xuzhou Normal University, Xuzhou, China, in 1995, the M.S. degree in applied mathematics from Changsha Railway University, Changsha, China, in 1998, and the Ph.D. degree in pattern recognition and intelligence systems from the Nanjing University of Science and Technology (NUST), Nanjing, China, in 2002. He was a Post-Doctoral Researcher with the University of Zaragoza, Zaragoza, Spain, in 2003. He was a Post-Doctoral Fellow with the Biometrics Research Centre, Hong Kong Polytechnic University, Hong Kong, from 2004 to 2006, and the Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA, from 2006 to 2007. He is currently a Professor with the School of Computer Science and Technology, NUST. He has authored over 80 scientific papers in pattern recognition and computer vision. His journal papers have been cited more than 3000 times in the ISI Web of Science and 7000 times in the Web of Scholar Google. His current research interests include pattern recognition, computer vision, and machine learning. Prof. Yang is also an Associate Editor of the Pattern Recognition Letters and the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS .

Jianjun Qian received the B.S. degree in computer science from Zhengzhou University, Zhengzhou, China, in 2007, and the M.S. degree in computer software and theory from Northwest University for Nationalities, Lanzhou, China, in 2010. He is currently an Assistant Professor with the School of Computer Science and Engineering, NUST. His current research interests include pattern recognition, computer vision, and face recognition.

Wei Xu received the B.E. degree in computer science and technology from the Nanjing University of Science and Technology, Nanjing, China, in 2009, where he is currently pursuing the Ph.D. degree in pattern recognition and artificial intelligence with the School of Computer Science and Engineering. His current research interests include image processing, computer vision, and saliency detection.

Matrix variate distribution-induced sparse representation for robust image classification.

Sparse representation learning has been successfully applied into image classification, which represents a given image as a linear combination of an o...
2MB Sizes 0 Downloads 8 Views