IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

1737

LGE-KSVD: Robust Sparse Representation Classification Raymond Ptucha, Member, IEEE, and Andreas E. Savakis, Senior Member, IEEE

Abstract— The parsimonious nature of sparse representations has been successfully exploited for the development of highly accurate classifiers for various scientific applications. Despite the successes of Sparse Representation techniques, a large number of dictionary atoms as well as the high dimensionality of the data can make these classifiers computationally demanding. Furthermore, sparse classifiers are subject to the adverse effects of a phenomenon known as coefficient contamination, where, for example, variations in pose may affect identity and expression recognition. We analyze the interaction between dimensionality reduction and sparse representations, and propose a technique, called Linear extension of Graph Embedding K-means-based Singular Value Decomposition (LGE-KSVD) to address both issues of computational intensity and coefficient contamination. In particular, the LGE-KSVD utilizes variants of the LGE to optimize the K-SVD, an iterative technique for small yet over complete dictionary learning. The dimensionality reduction matrix, sparse representation dictionary, sparse coefficients, and sparsity-based classifier are jointly learned through the LGE-KSVD. The atom optimization process is redefined to allow variable support using graph embedding techniques and produce a more flexible and elegant dictionary learning algorithm. Results are presented on a wide variety of facial and activity recognition problems that demonstrate the robustness of the proposed method. Index Terms— Dimensionality reduction, manifold learning, sparse representation, facial analysis, activity recognition.

I. I NTRODUCTION

I

NSPIRED by studies of selective firing of neurons in V1, the first layer of the primate visual cortex [1], [2], the notion of Sparse Representations (SRs) has found applications in a variety of scientific fields including facial understanding and activity recognition. A vectorized image x ∈ R D is efficiently represented by sparse linear coefficients from a dictionary  of m overcomplete basis functions, where  ∈ R Dxm . SR solves for coefficients a ∈ Rm that satisfy the 1 minimization of |a|1 s.t. xˆ = a. It has been shown that under typical conditions, the minimal solution is the sparsest one [3], [4]. There have been several studies optimizing both the 1 minimization [5], [6] as well as the selection of dictionary elements [7]–[9]. Manuscript received April 10, 2013; revised September 11, 2013 and January 2, 2014; accepted January 8, 2014. Date of publication January 29, 2014; date of current version March 4, 2014. This work was supported in part by Cisco and the National Science Foundation through a Graduate Research Fellowship under Grant 2010101003. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Carlo S. Regazzoni. The authors are with the Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY 14623 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2303648

Although the SR framework was designed for reconstruction purposes, it has been adapted successfully for classification problems. Wright et al. [10], [11] populated a dictionary with all n training samples, then passed the a coefficients into a minimum reconstruction error classifier. In this framework, the dominant signal always prevails, but it could produce some unintended effects. For example, when trying to extract facial identity, pose variation may contaminate or even dominate the sparse coefficients. Coefficient contamination is unfortunate yet important, as it has been shown that images of a single person under multiple poses exhibit greater variation than images of different people at a single pose [12]. To make the coefficient learning computationally tractable Wright et al. [10], [11] used random projections to reduce the dimensionality of the data. Tzimiropoulos et al. [13] and Zafeiriou and Petrou [14] used Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) respectively, along with SR techniques based on [11] to demonstrate further computational and accuracy improvements. The work in [14] experienced the adverse effects of coefficient contamination, and noted that applying Wright’s framework is not a straightforward process because the facial identity of the person is often confused with facial expression. Ptucha et al. [15], [16] addressed the coefficient contamination problem by preprocessing the data with supervised manifold learning. Similar to subspace clustering [17], supervision in manifold learning encourages clustering of sample images in accordance with their classification labels. Given n data samples, x 1 , x 2 , …, x n , each sample x i ∈ R D , stored in matrix X, X ∈ R Dxn and D < n, PCA and LDA are effective techniques for obtaining a lower dimensional representation of X. During PCA or LDA, the top d eigenvectors are used in projection matrix U such that the low dimensional representation of X is Y T = X T U , Y ∈ Rd xn . Although these linear dimensionality reduction techniques produce meaningful results, we seek an alternate representation of low dimension d, such that d  D. Further, the underlying linearity assumption of PCA and LDA may be limiting when modeling the behavior of complex imagery such as face representations. Manifold learning techniques reduce the dimensionality of input data by identifying a lower dimensional space where the data resides [18], [19]. In order to support the extension of the manifold model to new examples, linearized techniques called Linear extension of Graph Embedding (LGE) [20], solve a linear approximation of the non-linear object. Research in manifold learning has influenced the SR community and vice-versa. The Sparsity Preserving Projections

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1738

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

approach [21] replaces the adjacency matrix used in LGE techniques with SR sparse coefficients. Mairal et al. [22] injects a multiclass logistic regression term to the sparse energy function to make dictionary learning have both reconstructive and discriminative properties. Discriminative Sparse Coding [23] uses sparse coefficients in an LDA framework. Graph Regularized Sparse Coding [24] adds the LGE objective function on sparse coefficients to the traditional 1 sparse objective function as it jointly learns the sparse coefficients and dictionary terms. Sparse Embedding [40] performs dictionary learning in a low dimensional space as it jointly learns the dimensionality reduction matrix and dictionary. What has not been explored is a single method that optimizes manifold learning, SRs, and classification learning concepts into a single framework. This paper furthers the field of sparse representation classification by strengthening the relationship between manifold learning and SRs. In particular: • We propose a method to jointly optimize dimensionality reduction, sparse dictionary learning, sparse coefficients, and a sparsity-based classifier in an elegant framework that is broadly applicable to a wide variety of computer vision problems. • We employ the SR concept in a dimensionality-reduced space obtained by a semi-supervised variant of LGE. This dimensionality-reduced space minimizes coefficient contamination, is suitable for posed and natural datasets, static and temporal datasets, problems with small and large number of classes, and is computationally efficient. • The inclusion of all training samples as dictionary elements incorporates redundant information and noisy samples in the training set. As such, we co-optimize K-SVD dictionary learning to learn a small, yet over complete dictionary in dimensionality-reduced space. • The usage of K-SVD limits the applicability of simple classifiers [10], [11]. We therefore jointly optimize a sparse coefficient transformation matrix and class label identifier. • The atom optimization process in K-SVD is redefined to allow variable support using graph embedding techniques. We compare our framework, which we term LGE-KSVD, with other recently introduced methodologies and demonstrate superior performance across a wide variety of facial and activity classification problems. The rest of this paper is organized as follows. Sections II and III introduce the necessary principles of manifold learning and sparse signal representation. Section IV describes how to jointly optimize manifold learning and sparse representations. Section V presents experimental results. Section VI summarizes with conclusions. II. D IMENSIONALITY R EDUCTION A. Manifold Learning High dimensional feature spaces are not only inefficient and computationally intensive, but the sheer number of dimensions often masks the discriminative signal embedded in the data. For samples x i ∈ R D we seek a low dimensional representation

yielding yi ∈ Rd , where d < D. PCA is optimal in the sense of reconstruction error, LDA is optimal in the sense of linear classification, and both assume linear combinations of normal distributions. For LDA, d = 1 − k, where k is the number of classes. For PCA, we select the top d eigenvectors that encode most of the data variance. For example, to model the top T % variance of the training samples, d is solved such that λj /λk ≥ T /100; where j = 1, 2, . . . d; k = 1, 2, . . . D. Alternatively, the high dimensional feature space can be parameterized by a lower dimensional embedded manifold discovered using manifold learning [18], [19]. A manifold is a topological space that is homeomorphic in Rd . This representation is not only discriminant, but often finds d  D. B. Linear Extension of Graph Embedding During manifold learning a fully connected graph of the input space is constructed, where each of the n input samples or nodes is connected to all other (n − 1) input samples with a weight, 0 ≤ wi j ≤ 1, i, j = 1 . . . n. The resulting n×n connection matrix W is called the adjacency matrix and the connections or weights wi j can be solved several ways. For example, wi j is set to 1 if x i is amongst the z nearest neighbors of x j , 0 otherwise. Alternatively, wi j is set to 1 if ||x i − x j || < ε, and 0 otherwise. The goal of graph embedding is to preserve the similarities amongst neighbors from high and low dimensional spaces. The optimal Y is found by minimizing:  2 yi − y j wi j (1) i, j

As such, if neighbors yi and y j have a strong connection wi j , their geodesic distance should be minimal. W is defined similarly for X and Y , such that if neighbors x i and x j are close, yi and y j are also close. LGE seeks a linear approximation to this nonlinear concept of the form yiT = x iT U or Y T = X T U . Let D be a diagonal matrix of the column sums of W , Dii = j wi j , and L be the Laplacian matrix; then L = D − W . After simplification, this problem reduces to: UT X LXT U Uˆ = arg min T U X DXT U U

(2)

The optimal U is given by the minimum eigenvalue of the generalized eigenvector problem: X L X T U = λX D X T U

(3)

where U is the resulting projection matrix. Different choices of W yield a multitude of dimensionality reduction techniques such as LDA, Locality Preserving Projections (LPP) [25], and Neighborhood Preserving Embedding (NPE) [26]. For each approach, W is initialized to all zeros, and then connected wi j entries are determined by similarity. For LDA, nodes i and j are connected if they are from the same class. Connected wi j entries are set to 1/kn , where kn is the number of samples per their shared class: wi j = 1/kn

(4)

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

1739

For LPP, if nodes i and j are connected, then: wi j = e

xi −x j 2 τ

(5)

For NPE, if nodes i and j are connected, we solve the following objective function for element i in local reconstruction matrix M∈ Rnxn as a function of z nearest neighbors of x i , Nz (x i ):  2        x mi n  − M x Mi j = 1 (6) ij j  ,  i   j ∈Nz x j ∈Nz x ( i)

( i)

Then let: W = M + MT − MT M

(7)

Both LPP and NPE can be used in supervised mode by defining connected neighbors as the neighbors that share similar class labels. III. S PARSE S IGNAL R EPRESENTATION A. Sparse Representations A natural representation of a low dimensional sample y ∈ Rd from a training dictionary of m samples, ∈ Rd xm , is obtained by solving yˆ = a, where a∈ Rm is the weight of each training exemplar in the dictionary , and where ˆ· denotes estimate. However, in most practical cases, the system has either no solution or multiple solutions. For sparse signals, the objective of SRs is to identify the smallest number of nonzero coefficients a, such that yˆ = a. A convex relaxation approach was introduced by Donoho et al. [3] and Candes et al. [4] where it was shown that under certain constraints, such as the sparsity of the representation, the minimal solution is equivalent to the solution of the following regression problem in statistics: (8) â = arg min ||a||1 s.t.ˆy = a  where a1 = |a|. The benefit of using 1 minimization is that the problem can be efficiently solved using convex optimization algorithms. When noise is present in the signal, a perfect reconstruction is typically not feasible. Therefore, we require that the reconstruction be within an error tolerance. This optimization, called Basis Pursuit Denoising (BPDN), reformulates (8) as:   (9) aˆ = arg min a1 s.t.  yˆ − a 2 ≤  Often (9) is approximated by loosening the error constraints and reconfigured to specifically include a regularization term, λ which encourages sparseness by incurring a penalty on the resulting coefficients:   2 (10) aˆ = arg min  yˆ − a 2 + λ a1 Perhaps the most widely used method to solve the 1 minimization of (9) or (10) is Orthogonal Matching Pursuit (OMP) [27]. Given the SR coefficients â of a test image using the dictionary , a reconstruction error method estimates the class

k ∗ of a query sample y. Given k classes, the reconstructed sample using sparse coefficients â from all classes is compared to the reconstructed sample using coefficients â i from each respective class:   (11) k ∗ = min  y − aˆ i 2 i=1:k

Equation (11) assigns the test sample classification to the class most similar to the fully reconstructed signal. When constructing  the goal is to generate an over-complete dictionary with m > d. This allows the necessary degrees of freedom for choosing the sparsest solution and produces smooth and graceful coefficient activity across diverse test samples [28]. Wright et al. [10], [11] used all n training samples in . For efficiency, it is desirable to have a dictionary of small size, necessitating linearly independent or decorrelated samples. K-SVD [29] was introduced as a means to learn an overcomplete but small dictionary. K-SVD is an iterative technique, where at each iteration, training samples are first sparsely coded using the current dictionary estimate, and then dictionary elements are updated one at a time while keeping others fixed. Each new dictionary element is a linear combination of training samples. Rubinstein [30] implemented an efficient implementation of K-SVD using Batch Orthogonal Matching Pursuit. The works of [22], [31], and [32] jointly optimize dictionary learning and classifier training to select exemplars that minimize both reconstructive and discriminative errors. Jiang et al. [7] devised efficient methods for choosing  from a set of training exemplars by simultaneously minimizing both reconstruction and classification errors. The work in [7] encourages input samples from the same class to have similar sparse codes. B. Dimensionality Reduction and Sparse Representations Although methods for populating the adjacency matrix W vary, sparseness is one common characteristic across all techniques. Sparsity Preserving Projections (SPP) [21] is similar to NPE, but uses sparse coefficients from all other training samples, instead of local topology when solving for W . As such, similar neighbors are described by training samples with similar dictionary elements, replacing the y terms in (1) with coefficients â, claiming that nearby samples should have similar coefficients. Global Sparse Representation Projections (GSRP) [33] modifies the dimensionality reduction function in SPP to simultaneously maximize supervised class separability and minimize sparse representation error. Class separability is maximized by creating an adjacency matrix of all 0’s, then 1’s added to ij pairs where the k nearest neighbors of i are of a different class. The laplacian of this matrix is maximized via generalized eigenvector analysis similar to LGE methods. Similar to Sparsity Preserving Projections and Global Sparse Representation Projections, Discriminative Sparse Coding (DSC) [23] uses the sparse coefficients for each sample x using all other samples to populate matrix W . DSC defines similarity matrix W S , and difference matrix W D , where W S has 1’s where wi j = 0 have the same label, and W D has 1’s where

1740

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

wi j = 0 have differing labels. W S and W D add supervised similarity and dissimilarity matrices akin to LDA. Using W S as a dictionary, the sparse coefficients are recalculated for all input samples, then placed into β such that we can minimize S = X − W S β. Like LDA, DSC creates a term V that minimizes the difference of each sample with its class mean, but only for samples found in difference matrix W D . DSC solves for the dimensionality matrix by minimizing a convex combination of S and V during the generalized eigenvector analysis. Graph Regularized Sparse Coding [24] adds an unsupervised laplacian to (10). Similar to K-SVD, GRSC iteratively learns sparse codes â with fixed dictionary , then learns dictionary  with fixed coefficients â. After each iteration, the laplacian is minimized via gradient descent. Sparse Embedding (SE) [40] reduces the dimension of the input while maintaining sparse properties. SE minimizes (10) in dimensionality reduced space, both for computational efficiency and more accurate dictionary learning. By adding the dimensionality reduction term X −U T UX2F to (10), with  · F the Frobenius norm, UX being the low dimensional representation of the data, and U T UX being the corresponding high dimensional representation, SE minimizes reconstruction error in U , similar to PCA, to jointly learn , â, and U . Each of the above methods introduces a new dimensionality reduction technique or a new SR technique. What lacks is a unified approach that optimizes dimensionality reduction projection matrix U with dictionary , and coefficients â. In the next section we present such a method, which we call LGE-KSVD [41], for the optimization and infusion of Linear extension of Graph Embedding with K-SVD dictionary learning.

IV. T HE LGE-KSVD M ETHOD Classification frameworks based on SR concepts have been found to suffer from: 1. Coefficient contamination that compromises classification accuracy, and 2. Computational inefficiencies due to high dimensional features and large dictionaries. The usage of LGE-KSVD aims to address both limitations via semi-supervised dimensionality reduction and K-SVD dictionary learning. We have considered many ways to combine LGE concepts with K-SVD, and the proposed LGE-KSVD approach presents a framework that provides robust and accurate classification across a diverse field of computer vision problems. We begin by combining the dimensionality reduction matrix U from (2) with a method to learn a dictionary  and sparse coefficients â. K-SVD solves: {,ˆ a} ˆ = arg min x − a22 s.t. a0 ≤ δ

(12)

Coefficients â from each training example are stored into matrix A, A ∈ Rmxn . Combining (2) with (12), dictionary 

is now solved in a low dimensional space  ∈ Rd xm giving: T T ˆ = arg min (X T U )T − A22 + U X L X U ˆ A} {Uˆ , , T U X D X T U   (13) st. A(:,i) 0 ≤ δ

The first term performs K-SVD optimization in low dimensional space, and the second term is the LGE dimensionality reduction objective function. This formulation is similar to SE [40], but does not require the dimensionality reduction term to be bound to a minimum reconstruction error. As will be shown shortly, it is beneficial to use a convex combination of supervised and unsupervised methods in the dimensionality reduction step to minimize coefficient contamination. If we use PCA for the LGE dimensionality reduction function, LGEKSVD can be made to match SE. Since we are solving a classification problem, we need to convert test sample coefficients â into a classification label estimate. Dictionary elements from K-SVD are a linear combination of input samples, preventing the direct application of the minimum reconstruction error in (11). Latent Sparse Embedding Residual Classification [40] creates k dictionaries and k dimensionality reduction matrices, where k is the number of classes, to apply (11), but is not practical for large class problems. Alternatively, we can pass the coefficients â into SVM or any machine learning based classifier. Since a linear combination of dictionary elements (represented by â) can be used to represent a new test sample, it is reasonable to assume that a linear weighting of these same coefficients can be used to represent the class of the new test sample. Empirical studies show good classification with coefficient transformation matrix C, C ∈ Rmxk , where k is the number of classes and m is the number of dictionary elements. C will convert our coefficients â into a class estimate. A linear transformation is fast but sufficient as the semi-supervised usage of LGE minimizes coefficient contamination and strikes a good balance between exercising class separation and maintaining input topology. To solve for C, we define H as a sparse ground truth matrix, H ∈ Rkxn . Each column of H corresponds to a training sample, where the k th element is set to 1 if yi belongs to class k, 0 otherwise. This problem is formulated as: Cˆ = arg min H − C T A22

(14)

The above can be solved directly via ridge regression: C = (A A T )−1 AH T .

(15)

A. Training Procedure for LGE-KSVD Equation (13) is neither directly solvable nor convex. Using K-SVD with n training samples, we learn a dictionary  of m atoms, where m ≤ n via an implicit transformation T , T ∈ Rmxn , resulting in  = TYT = TXT U. As such, the dictionary transformation function T and the dimensionality reduction transformation function U will oscillate if we indiscriminately iterate one after the other. For smooth and reliable results, we desire an overcomplete dictionary in which the number of samples, m is greater than

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

1741

Fig. 2.

Fig. 1.

Training procedure for LGE-KSVD.

the number of dimensions d. T has rank ≤ m, U has rank ≤ d. Since m ≥ d, T in general has more degrees of freedom and it is preferable to iterate on T more often than U . As we minimize reconstruction errors in (13), coefficients A develop into more accurate representations of X. Although it is not guaranteed, experimental results will show that more accurate representations of X result in lower classification errors. In [15] it was shown that supervised dimensionality reduction minimizes SR coefficient contamination by enforcing class separation. A discriminative dictionary was utilized in [7], [21]–[23], and [33]. Similar to [23] and [33], we find better results if the SR energy function minimizes reconstruction errors and the LGE energy function encourages class discrimination. Not only does this offer superior classification results, but because we are operating in a low dimensional space, the resulting framework minimizes compute intensity. After an initial dimensionality reduction matrix U is obtained via semi-supervised LGE, we propose a double nested iterative training procedure. The outer loop updates U based upon the current best estimates of  and a, and the inner loop uses K-SVD to iteratively update a, then , based upon the current best estimate of U . To get an updated estimate of U , the update problem is formulated as: Uˆ = arg min X T U − A T T 22

(16)

and is solved directly: T −1

U = (X X )

XA  T

T

(17)

Fig. 1 summarizes the training procedure. The choice of LGE in Step 1a of the training procedure should be a discriminative embedding that maintains input topology. The best approach we have found [15] uses a convex combination of supervised and unsupervised adjacency matrices W L D A and WGaussian corresponding to (4) and (5) respectively. The two are combined into a single W : W = αW L D A + (1 − α)WGaussian

(18)

The K-SVD algorithm.

For posed datasets which are linearly separated, W L D A is weighted higher. For natural datasets or problems where the number of classes is small, we emphasize WGaussian . Classification problems with small number of classes reduce the rank of W L D A to (k − 1). For example, to determine gender from facial images, W L D A would restrict U to a D × 1 projection matrix, where D is the number of image pixels in each face exemplar. The initialization Step 3 in the training procedure is done such that each of the k classes is represented in proportion to the number of samples that it occupies in the training set. B. Modified K-SVD While learning the dictionary for a given training set, K-SVD (see Fig. 2) alternates between solving for A given the current estimate of , then solving for  given the current estimate of A. While updating , the algorithm does not allow the participating training samples that combine to form each dictionary element to change. LGE-KSVD allows the support of each dictionary element to grow, shrink, or stay the same by incorporating LGE concepts into the K-SVD algorithm. This change in support promotes faster convergence or more accurate dictionaries if iterations are fixed. These modifications to K-SVD are described next. The K-SVD penalty term of (12) can be rewritten as:  2 m     z 2 z A T  X − A F =  X −   z=1 F 2 ⎛ ⎞   m   j z⎠  ⎝X − =  A A −  z T j T  (19)    z=1, z = j F

=

j E j −  j A T 2F

where A is decomposed into the sum of m rank-1 matrices ∈ Rd xn , each representing one dictionary element, with  ∈ Rd xm and A ∈ Rmxn . The error, E j is the total error for all n training samples with the j th dictionary element removed. The step of updating dictionary elements sequentially updates one element at a time (see Fig. 2). While updating element j , K-SVD assumes that dictionary matrix  and sparse coefficient matrix A are fixed except for (column) element j of , j ∈ Rd x1, and the corresponding coefficients for j , which comprise row j of coefficient matrix A, j AT ∈ R1xn .

1742

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Singular Value Decomposition (SVD) solves the closest rank-1 matrix that approximates E j in (19), yielding the optij mal j and AT directly; the new best estimates for dictionary element j and its corresponding coefficients. Unfortunately j AT would tend to use many, if not all training samples, j resulting in a non-sparse AT . K-SVD enforces sparsity by fixing the support of j to only those training entries with non-zero coefficients of element j ; as such, at initialization, each dictionary element is paired up with a list of training samples that can never change. To ensure sparsity, a vector η j is initialized with zeros, then corresponding entries with nonj zero AT (indicating which training samples can be used for dictionary element j ) are used as indices to x i that use dictionary element j :

 j η j = i |1 ≤ i ≤ J, A T (i ) = 0 (20)

Fig. 3.

The modified K-SVD Algorithm.

Rnx|ηj | .

By multiVector η j is used to build matrix j ∈ j plying AT and E j by j , we remove zero entries, using only the subset of dictionary samples that use element j , forming j restricted row vector AR and error matrix E Rj :  2      j j 2 E j j −  j A T j  = E Rj −  j A R  F

F

(21)

By minimizing (21), K-SVD forces the solution of j and j to have identical support to the original AT . SVD directly decomposes the rank-1 matrix E Rj = UVT . j is the first j AR

j AR

is the first column of V ×  (1, 1). column of U, and K-SVD can be modified by incorporating supervision into the updating of dictionary element j . Each dictionary element is assigned a class using a minimum reconstruction error via (11), and then the element’s support, controlled via η j , is restricted to the training samples of its own class. K-SVD without or with supervision fixes the support of each element at the time of dictionary initialization. The former makes K-SVD dependent upon a good initialization guess; the latter may seem good from a classification perspective, but is poor from a reconstruction perspective, as test samples often necessitate elements from more than one training class. An improvement over both approaches is to let the training samples that contribute to each dictionary element be governed by sample-to-sample similarity and class labels. Further, as long as sparsity is maintained, it is desired for the support of each element, η j , to change at each iteration. We propose to use semi-supervised LGE adjacency matrix W as per (18) to regulate the support of each dictionary element. In particular, the support of dictionary element j may: a) Expand: Modify the support of element j by adding (union) all training entries similar to element j . b) Contract: Modify the support of element j by removing (intersection) training entries not similar to element j . c) Redefine: Set the support of element j to be only training samples similar to element j . d) Fixed: Maintain the support of element j , as in the K-SVD algorithm. In LGE-KSVD, similar is defined in terms of the LGE adjacency matrix W , i.e., training samples respect all samples

of same class or nearest neighbors, as defined by W . During the updating of dictionary element j , LGE-KSVD modifies the support of E j using the expand, contract, redefine, and fixed operations, creating a different η j and E Rj for each condition. j

Given four sets of η j , E Rj , j , and AR we choose the j and j

AR that minimize the penalty term (19) as:    j 2 J } = arg min  R J , A { −  A E ji ji Ri  R i=1:4

F

(22)

Equation (22) ensures that this LGE modification to K-SVD converges at a rate faster than or equal to K-SVD. Rather than assigning a single class to each element j , j LGE-KSVD uses the top coefficients from AR , i.e.: the most influential training samples for each dictionary element. Each of those top coefficients is used as a look-up into adjacency matrix W . All training samples similar to each of those top coefficients (as defined by W ) are used to expand, contract, or redefine E Rj before the SVD decomposition. The top coefficients are solved by keeping the top percentile of energy. Deciding which neighbors are similar, based on the top coefficients, is dependent upon the technique used to build W . By using a sparse W matrix, any non-zero entry is a similar neighbor to j . For example, by setting τ = 1.0 in (5), similar neighbors are those with either the same ground truth class, or neighbors that have a small 2 norm. It is often desirable to decrease α in (18) or increase τ in (5) for element neighbor similarity determination. Both modifications make W less sparse, and perhaps less discriminate, but simultaneously make W more open to finding relationships between diverse training samples. Fig. 3 summarizes the modified K-SVD algorithm. C. Testing Procedure for LGE-KSVD The training procedure solves for dimensionality reduction matrix U , dictionary, , and coefficient transformation matrix C. Given a test sample x, along with U , , and C, Fig. 4 summarizes the testing procedure. The coefficient transformation matrix C in Step 3 of the testing procedure converts the sparse coefficients â to class label estimates.

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

Fig. 4.

1743

Testing procedure for LGE-KSVD.

Fig. 6. Sample subjects exhibiting (from top to bottom) static facial expressions, facial identity, temporal emotion, and multi-view activity from the CK+, YaleB, GEMP-FERA, and i3DPost datasets respectively.

Fig. 5.

Full LGE-KSVD training procedure.

Ideally, l would contain most of its energy in one class. The normalized energy of li and other l values provide confidence indicators as to which class x belongs. D. Method Overview and Parameter Selection Because there are many aspects to LGE-KSVD, Fig. 5 summarizes the full training procedure. The choices and intuition for each step along with the optimization of the necessary tuning parameters are briefly summarized in this section. We begin with dimensionality reduction on the input data to minimize compute resources. We gravitate to the LGE

family of methods of which LPP generally outperforms other methods. The introduction of supervision to LGE is necessary to minimize coefficient contamination, but, without some unsupervised component it is difficult to capture class to class similarities. The first iteration uses semi-supervised LGE dimensionality reduction using (18), subsequent iterations minimize reconstruction error given  and A. Given LGE dimensionality reduction, we need to specify what flavor of SR to use, a value for τ in LPP, and the value of α to generate semi-supervised dimensionality reduction. Regarding the selection of a SR technique, several pursuit algorithms are quite good at solving (9) or (10). We found that it is optimal to use (10) with 0.1 ≤ λ ≤ 0.3. Furthermore, we use non-negative coefficients for SR classification purposes. Regarding the value of τ in (5), we generally require a small amount of supervision, and thus we select τ = 1 for all experiments. Regarding the selection of α in (18), as long as α = 0 and α = 1, results are reasonable. We recommend α values in the range 0.1 ≤ α ≤ 0.9, and use α = 0.5 for all experiments. Experiments will show that the introduction of K-SVD not only shrinks the size of the dictionary, but also results in higher classification accuracy. Regarding the selection of m, the size of the dictionary, we start by setting m = n/2, where n is the dataset size. Motivated to remove the support restrictions imposed by K-SVD, we adopt an LGE approach into the K-SVD algorithm. We find that increasing τ in the creation of the adjacency matrix used by K-SVD is necessary to find similar neighbors. A value of τ = 100 is a good rule of thumb which we use in all experimental results. The selection of top coefficients is done by picking any training sample that contains 50% or more of the total coefficient energy. Similar neighbors are those elements whose Euclidean distance in the W matrix includes 99% of all related training samples. This accounts for 100% of samples of the

1744

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

same class, and other samples that are deemed to be similar by the unsupervised LPP adjacency matrix. V. E XPERIMENT We evaluate the LGE-KSVD approach on four widely used databases: the extended Cohn-Kanade (CK+) facial expression dataset [34], the extended Yale B facial recognition database [35], the Facial Expression Recognition and Analysis Challenge (FERA2011) GEMEP-FERA [36] dataset, and the i3DPost multi-view activity recognition dataset [37], as shown in Fig. 6. We test each dataset across three categories of (i) dimensionality reduction; (ii) sparse representation; and (iii) combined techniques. The dimensionality reduction techniques include PCA, LDA, LPP [25], NPE [26], and Sparsity Preserving Projections (SPP) [21]. The sparse representation methods include K-SVD [29], LC-KSVD1 and LC-KSVD2 [7]. The combined methods include Sparse Representation-based Classification (SRC) [11], Manifold based Sparse Representation (MSR) [15], and our proposed LGE-KSVD method. A. Testing Datasets The CK+ [34] expression dataset contains 118 subjects in 327 sequences exhibiting the expressions of anger, disgust, fear, happiness, sadness, surprise, and contempt. An Active Appearance Model (AAM) automatically localizes 68 points on the face. The AAM eye and mouth corner points are used to define an affine warp to a canonical face of 60 × 51 pixels. As such, from this dataset we compare two variants: D = 68 × 2 = 136 (AAM point based), and D = 60 × 51 = 3060 (pixel based). Each has 164 training and 163 testing faces (chosen randomly), and the K-SVD methods use a dictionary size of m = 63 elements. The Extended YaleB facial recognition dataset contains 2,414 frontal images of 38 people under varying illumination and facial expression. Each face is 192 × 168 pixels which are reduced to D = 504 via random projections following [11]. The test set contains 1216 training faces and 1198 testing faces. The K-SVD methods use a dictionary size of m = 570 elements. The GEMEP-FERA temporal expression dataset contains 155 training and 134 testing videos. Each video sequence varies from 20–150 frames of 10 actors exhibiting the five emotions of anger, fear, joy, relief, and sadness. Automatically localized eye and mouth corner points define an affine warp to a canonical face of 60 × 51 pixels per each frame. A sequence of 16 frames at the 1/3rd and 2/3rd mark of each video is fed into Motion History Image (MHI) [38] analysis adapted for facial expression analysis [39]. As per [39], MHI yields a 24 × 20 dense optical flow per sequence. The X and Y coordinates at each 24 × 20 grid point for each of the two sequences formed the D = 1920 input dimensions per sample. The K-SVD methods use a dictionary size of m = 75 elements. The i3DPost multi-view [37] activity recognition dataset contains 768 videos of 8 people performing 12 actions from 8 views. The 12 activities are walk, run, jump, bend, handwave, jump in place, sit-stand, run-fall, walk-sit, run-jumpwalk, handshake, and pull. Each video is processed to extract

MHI features, giving 125 MHI sequences where each sequence contains 1500 motion vector points. PCA yielded 767 dimensions per video. The dataset contains 512 training videos and 256 testing videos. The K-SVD methods use a dictionary size of m = 450 elements. B. Testing Methodologies We compare LGE-KSVD with three types of classification approaches: dimensionality reduction followed by SVM, K-SVD approaches, and sparse representation classification with dimensionality reduction. Dimensionality reduction techniques capture 99.9% of the data variance and are followed by multi-class linear SVM classifiers. LDA uses equation (4), LPP uses (5), and NPE uses (6) and (7). SPP is similar to NPE, but modifies equation (6) to use sparse coefficients. The sparse representation techniques all use K-SVD to define a training dictionary of size m, where m < n. Coefficient transformation matrix C is generated from the training set as per (15). Test samples use the m element dictionary to generate sparse coefficients using (10), setting λ = 0.25. These sparse coefficients are converted to a class estimate using (23). LC-KSVD1 modifies the K-SVD objective function to favor clustering of coefficients by class and LC-KSVD2 further modifies the K-SVD objective function to include the solution of coefficient transformation matrix C. The SRC method uses random projection matrices for dimensionality reduction. The low dimensional projection of all training samples forms the training dictionary. The corresponding sparse coefficients of test samples use (11) to make a final classification estimate. The MSR method is identical to SRC, except the random projection dimensionality reduction is replaced with supervised LPP. All LPP methods use α = 0.5 in creation of W using (18). The LGE-KSVD method uses a τ = 1 for dimensionality reduction W and τ = 100 for element neighbor similarity W . The LGE-KSVD method keeps the top coefficients which j make up 50% of the total energy from AR . C. Experimental Results Table I demonstrates the performance of the five dimensionality reduction methods, the three sparsity based methods, and the two combined methods against LGE-KSVD on the 7-class CK+ dataset using the 68 AAM points. Because the data has only 136 dimensions, no dimensionality reduction is used for K-SVD, LC-KSVD1, LC-KSVD2, or SRC. This is a posed dataset, and as such LDA performs the best out of the dimensionality reduction techniques. The LGE-KSVD method has two numbers in the accuracy entry for Tables I–V. The first is with outer while loop (Fig. 5) iterative convergence turned off (1 iteration), and the second is the accuracy after convergence. The value in (·) after the second accuracy entry is the number of outer loop iterations required for convergence. Table II uses the same CK+ dataset from Table I, but uses 60 × 51 images as input. This higher dimensional space is not as discriminative as the 68 AAM points, but all methods do well because of the large class to class separation.

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

1745

TABLE I C LASSIFICATION R ESULTS ON THE 7-C LASS CK+ E XPRESSION D ATASET,

TABLE IV C LASSIFICATION R ESULTS ON THE 5-C LASS GEMEP-FERA E MOTION

U SING 68 AAM P OINTS . 164 T RAINING AND 163 T ESTING S AMPLES

D ATASET. MHI M OTION V ECTORS . 155 T RAINING AND

134 T ESTING V IDEOS

TABLE II C LASSIFICATION R ESULTS ON THE 7-C LASS CK+ E XPRESSION D ATASET,

TABLE V

U SING 60 × 51 I MAGES . 164 T RAINING AND 163 T ESTING S AMPLES

C LASSIFICATION R ESULTS ON THE 12-C LASS I 3DP OST M ULTI -V IEW A CTIVITY R ECOGNITION D ATASET. 512 T RAINING V IDEOS , 256 T ESTING V IDEOS

TABLE III C LASSIFICATION R ESULTS ON THE 38-C LASS YALE B R ECOGNITION D ATASET. 192 × 168 P IXEL I MAGES R EDUCED TO 504 D IMENSIONS VIA

R ANDOM P ROJECTIONS . 1216 T RAINING I MAGES , 1198 T ESTING I MAGES

Table III uses the 38-class YaleB facial recognition dataset. The 504 random projection input for all methods was further reduced in dimensionality as indicated by the d column, where d is the dimension where classification is performed. The SR methods are advantaged over the dimensionality reduction methods, while the MSR and LGE-KSVD methods perform the best.

Table IV uses the 5-class GEMEP-FERA emotion dataset. Two MHI optical flow sequences per video were used as input. The dimensionality reduction methods are advantaged over the SR methods, and the combined methods perform better than the dimensionality reduction methods. Table V uses the 12-class i3DPost multi-view activity recognition dataset. The 767 PCA projection input for all methods was further reduced in dimensionality as indicated by the d column. While there is no clear winner on this dataset, the semi-supervised LPP methods performed the best. The results in Tables I–V show impressive accuracy performance of LGE-KSVD across a wide variety of problem sets. We attribute this to the discriminative strengths of dimensionality reduction, the classification power of SR methods, along with the integration of LGE into the K-SVD dictionary learning architecture. When SR methods have insufficient training exemplars in , their performance lags behind SVM classification methods. When datasets are posed, LDA dimensionality reduction is preferred; when datasets are natural, semi-supervised LPP or NPE methods are preferred. The LGE-KSVD representation offers the discriminative properties of LDA while maintaining the local topology of complex data representations in the low

1746

Fig. 7. Sample angry face (shown in top left of each sub-figure) with SR coefficents in top two plots, and location of coefficients in lowest three dimensions of dimensionality reduced space on bottom. Left two plots are for PCA. Right two plots are for LPP. (Best if viewed at 200%.)

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 8. Sample happy face (shown in top left of each sub-figure) with SR coefficents in top two plots, and location of coefficients in lowest three dimensions of dimensionality reduced space on bottom. Left two plots are for PCA. Right two plots are for LPP. (Best if viewed at 200%.)

dimensional manifold representations. As such, LGE-KSVD has been shown to be robust over datasets with few vs. many classes, high vs. low dimensionality, posed vs. spontaneous faces, static vs. temporal features, and across the classification problems of facial expression, facial recognition, and human activity recognition. D. Analysis of LGE-KSVD By introducing LGE before SR, we are able to achieve high classification accuracy and minimize compute resources. The utilization of K-SVD not only makes the framework more efficient, but also increases accuracy and robustness. Furthermore, the addition of supervision to dimensionality reduction minimizes coefficient contamination associated with SR classification. To demonstrate this effect, 1072 frames were extracted from the CK+ expression dataset comprised of 99 subjects and the five expressions of angry, happy, neutral, sad, and surprised. Every fourth sample was extracted for testing, forming a test set of 357 images. The remaining 715 images formed a training set for which the dimensionality reduction matrix U was derived and the low dimensional versions of each training sample formed a training dictionary. Fig. 7 shows a sample angry face (upper left of each plot) that has undergone dimensionality reduction with PCA and LPP using (4), before SR coefficient extraction. The top row [Fig. 7(a) for PCA and Fig. 7(b) for LPP] shows the SR coefficient magnitudes grouped by expression class. The small thumbnails correspond to the top 20 coefficients centered at each coefficient peak. The bottom row [Fig. 7(c) and (d)] shows the point distribution of the entire dataset color-coded by expression, with small thumbnails inserted at the location of the same top 20 coefficients. Fig. 7(a) illustrates that the top 20 coefficients from PCA exhibit contaminations and are strongly biased towards the

Fig. 9. Accuracy of LGE-KSVD as a function of adjacency matrix W parameter α in (18).

same test subject under various expressions. Fig. 7(b) illustrates that the top 20 coefficients from LPP are not prone to coefficient contamination and are obtained from multiple subjects exhibiting the same angry expression. A comparison of Fig. 7(c) with Fig. 7(d) illustrates the clustering ability LPP supervision adds to the dimensionality reduction process. Fig. 8 demonstrates the same characteristics as Fig. 7, this time using a sample happy face. PCA is now shown to include both happy faces from other subjects, as well as the same subject exhibiting different expressions. PCA clusters the samples by expression and identity, while LPP clusters by expression offering better class separation. By utilizing a semi-supervised variant of LGE in (18), we get excellent results across posed and natural datasets. Fig. 9 shows the effect of the α blend parameter used in (18). A value of α = 1.0 is fully supervised, and a value of α = 0 is fully unsupervised. We have consistently found that LGE-KSVD using (18) is quite robust for all 0.1 ≤ α ≤ 0.9. The LGE-KSVD framework was designed to minimize the number of iterations- ideally the goal is to have one

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

1747

TABLE VI A CCURACY I MPROVEMENTS BY LGE-KSVD I TERATIONS

Fig. 11. RMSE Error across 160 sub iterations of LGE-KSVD on the CK+ dataset using image pixels. Each iteration of LGE_KSVD has 20 K-SVD iterations. Each vertical line denotes the end of an iteration of the LGE-KSVD procdure with the corresponding test set classification accuracy denoted in parenthesis.

Fig. 10. Performance of the four K-SVD methods as a function of dictionary size on the i3DPost dataset.

iteration. Since (13) is not directly solvable, we need to solve iteratively for the SR and LGE parameters. Table VI shows the percent improvement from the first iteration of LGE-KSVD to the stopping condition for each of the five datasets. The GEMEP_FERA dataset has the lowest starting accuracy, and thus the potential for largest improvement. The number of LGE-KSVD iterations is often small, as the LGEKSVD method converges quickly. Fig. 10 shows the effect of the dictionary size m on the i3Dpost multi-view dataset. While the performance of other techniques decreases noticeably with smaller dictionary sizes, LGE-KSVD remains robust to dictionaries as small as m = 50. We attribute this to the clustering abilities of LGE before performing SR classification. Equation (13) minimizes the reconstruction error in a dimensionality-reduced space, while the end goal is to minimize the classification error. One benefit of successive reductions in reconstruction errors is that the input data is more faithfully represented and generally yields lower classification errors. Fig. 11 demonstrates this effect over eight iterations of the LGE-KSVD algorithm on the CK+ dataset using image pixels. Each iteration is separated by gray vertical bars and contains 20 modified K-SVD iterations. The blue line shows the RMSE reconstruction error. The numbers in (·) show the resulting classification accuracy at the end of each LGE-KSVD iteration. While these numbers are not monotonically decreasing, there is a strong trend towards achieving better classification accuracy with each iteration. The restriction of dictionary element support in the K-SVD algorithm was devised by [29] to ensure the solutions of

each element remained sparse; this is desirable in order to achieve an optimal solution with an overcomplete dictionary. The modified K-SVD algorithm enables the support of each dictionary element to grow or shrink with each call to SVD. While this enables LGE-KSVD to converge faster to minimal reconstruction errors, we may ask whether this is done at the risk of sacrificing sparsity. Experimental evidence suggests sparsity is indeed maintained across varied datasets. Further, because we use (10) to calculate sparse coefficients, given the current dictionary estimate at the beginning of each modified K-SVD iteration, sparsity is kept in check. To control sparsity using the modified K-SVD algorithm, the τ value in (5) should be kept high. This enables the W matrix to more easily find and discriminate between similar neighbors, and only the most similar are kept in each iteration. During dictionary initialization, each element is initialized with training samples from a single class. With each successive iteration, only training neighbors that are of the same class or have small distances are considered similar, and only similar training samples contribute toward each dictionary element. Figs. 12 and 13 demonstrate the change in support over twenty modified K-SVD iterations on the CK+ dataset. The restrict mode is in red, add is in green, subtract is in blue, and the fixed K-SVD default is in black. The mean support for the restrict, add, and subtract method per each iteration is shown in Fig. 13 and overlaid in thicker solid colored lines in Fig. 12. The support can go as high as the number of training samples, which for the CK+ dataset is 164 samples, meaning that a dictionary element can, in theory, be a linear combination of all 164 samples. With 67 dictionary elements, there are 67 updates per each modified K-SVD iteration. Fig. 13 shows 20 modified K-SVD iterations (1340 steps in Fig. 12). After an initial increase in support (or decrease in sparsity), each of the modified K-SVD methods level off after just a few iterations. Similar results are found on the other datasets. To provide better insight on the algorithm operation during each K-SVD iteration, Fig. 14 shows the first 4 modified

1748

Fig. 12. The change in dictionary element support for the CK+ dataset using image pixels across 20 modified K-SVD iterations in the restrict, add, subtract, and default modes. The dictionary size is 67, yielding 67 steps in each of the 20 iterations.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

set. Fig. 14 shows these classes sorted in the first iteration with the numbers in (·) above/below the red curve indicating the breakdown of the distribution of elements for each class. Class one was given 9 out of 67 dictionary elements, class two was giving 4 out of 67, etc. During the first iteration, the restrict mode constrains elements to belong to the initial class with the height of each bar reflecting the number of training samples for each class. There are 167 training samples to pick from, the sum of the seven bar heights in the first iteration. In between each iteration, we re-solve for the coefficients A given the most current dictionary estimate. In the second iteration, one of the nine dictionary elements for class one had significant energy (defined by magnitude of coefficient) from a training sample from class seven. As such, this particular dictionary element was expanded to include this new training sample, thus expanding its support to samples from class one and seven. With each successive iteration, elements that are similar to one another are added or removed to make the most efficient use of the dictionary. We use τ = 100 for element neighbor similarity W to ensure the support of each dictionary element remains small. After five or six iterations, the activity settles down and only minor changes occur. After 20 modified K-SVD iterations, the clear delineation between classes in the first iteration is all but replaced with a more efficient representation that more clearly represents the topology of the training set. E. Time Complexity of LGE-KSVD

Fig. 13. The mean change in dictionary element support for the CK+ dataset using image pixels across 20 modified K-SVD iterations in the restrict, add, subtract, and default modes.

Fig. 14. The change in dictionary element support for the CK+ dataset using image pixels across 4 modified K-SVD iterations in the restrict, add, subtract, and default modes.

K-SVD iterations from Fig. 12. Fig. 12 used a semi-supervised W matrix to initialize the dictionary by setting α = 0.5 in (18). To clarify this point, Fig. 14 uses only the LDA component, by setting α = 1.0 in (18). The CK+ dataset has seven classes. The 67 element dictionary is initially divided amongst those classes to reflect the distribution in the training

Despite using batch orthogonal matching pursuit for the calculation of sparse coefficients and an SVD decomposition approximation for dictionary learning [30], the SR step used by LGE-KSVD is computationally demanding. During training, LGE-KSVD calls K-SVD 20 times, or computes dictionaries and sparse coefficients 20 times per inner iteration. Each outer iteration also includes the task of updating the dimensionality reduction matrix and coefficient transformation matrix, each requiring a matrix inverse. Referring to Fig. 5, we have the following analysis during training: Step 1: Calculation of U → O(D 3 ). Step 2: Calculation of low dimensional samples, Y T = X T U , X∈ R D×n , U ∈ R D×d → O(Ddn). Step 4: As per [30], we have: A) Generating sparse coefficients: O(2dm + 2Im + 3Im + 3I), where I is # iterations, which is fixed to 20, giving: O(2dm + 40m + 60m + 60) ≈ O(dm + 100m) B) The dictionary step: O(n(I 2 m+ 2dm)), where I is # iterations, which is fixed to 20, giving: O(n(400m + 2dm)) Because we have steps 4a…4d, we have O(4n(400m + 2dm)) Step 5: Calculation of C →≈ O(n 3 ) Summing up and substituting m = d/2, we get O(D 3 + Ddn + 2d + d 2 /2 + 100m + 1600nm + 4d2 n) per iteration.

PTUCHA AND SAVAKIS: LGE-KSVD: ROBUST SR CLASSIFICATION

The significance of the first term is dramatic as D d, however optimizations of this term are controlled by max(D, n) due to rank limitations of the matrix. In practice, the steps of generating sparse coefficients and dictionary learning using K-SVD are 2-3 times slower than the calculation of U using PCA, LDA, LPP, or NPE. Because SRC and MSR do not use dictionary learning, the dictionary update step is omitted. Because LGE-KSVD has steps 4a…4d, it is slower than K-SVD techniques by O(4n(400m + 2dm)). Multiple iterations of LGE-KSVD are costly, as the above analysis assumes one iteration. During testing, U, , and C are pre-computed. Referring to Fig. 4, we have: Step 1: y = x T U , x ∈ R D×1 , U ∈ R D×d → O(Dd) Step 2: as per [30] O(2dm + 2Im + 3Im + 3I), where I is # iterations, which is 1 for testing, giving: O(2dm + 4m + 1) Step 3: l = C T a, C ∈ Rm×k , a ∈ Rm×1 → O(mk) Once again, we let m = d/2, then summing up: O(Dd + d 2 + 2d + 1 + (0.5)dk) ≈ O(Dd + d 2 + d + dk). Because D d, the importance of Step 1 needs to be emphasized, without such, the d 2 term would become D 2 and dominate the analysis. Because D d, the testing or runtime difference between LGE-KSVD is quite similar to the dimensionality reduction methods with O(Dd). In summary, the training phase for LGE-KSVD is slower than other techniques, but the testing or run-time phase is comparable to other techniques.

VI. C ONCLUSION This paper presents LGE-KSVD, a new method that integrates manifold-based dimensionality reduction and sparse representations within a single framework. We use LGE dimensionality reduction techniques along with K-SVD dictionary learning to co-optimize dimensionality reduction matrix U , dictionary , training coefficients A, and coefficient transformation matrix C. We further leverage LGE dimensionality reduction concepts to modify the K-SVD dictionary learning algorithm such that the support of dictionary elements remains sparse, but is no longer fixed. Additionally, using dimensionality reduction as a preprocessing step to dictionary learning reduces compute resources. Iterating further improves both reconstruction and classification accuracy. Semi-supervised dimensionality reduction produces sufficient discrimination between input classes to minimize coefficient contamination, while preserving enough geometric detail from the high dimensional space for real world imagery. The modification of the dictionary element update step in K-SVD improves reconstruction and classification accuracy while making K-SVD more generally applicable to a broader spectrum of problems and less reliant on a good initialization. Results demonstrate that the proposed LGE-KSVD framework provides significant advantages over other state-of-the-art techniques across a wide variety of facial and activity classification problems.

1749

R EFERENCES [1] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23, pp. 3311–3325, 1997. [2] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. [3] D. L. Donoho, M. Elad, and V. N. Temlyakov, “Stable recovery of sparse overcomplete representations in the presence of noise,” IEEE Trans. Inf. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006. [4] E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. [5] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Proc. Adv. NIPS, 2006, pp. 1–8. [6] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004. [7] Z. Jiang, Z. Lin, and L. S. Davis, “Learning a discriminative dictionary for sparse coding via label consistent K-SVD,” in Proc. IEEE Conf. CVPR, Jun. 2011, pp. 1697–1704. [8] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discrimination dictionary learning for sparse representation,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 543–550. [9] M. Julien, B. Francis, P. Jean, and S. Guillermo, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, Mar. 2010. [10] A. Yang, J. Wright, Y. Ma, and S. Sastry, “Feature selection in face recognition: A sparse representation perspective,” Univ. California, Berkely, CA, USA, Tech. Rep. UCB/EECS-2007-99, 2007. [11] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and M. Yi, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [12] J. Sherrah, S. Gong, and E. J. Ong, “Face distributions in similarity space under varying head pose,” Image Vis. Comput., vol. 19, no. 12, pp. 807–819, 2001. [13] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Sparse representations of image gradient orientations for visual recognition and tracking,” in Proc. IEEE CVPRW, Jun. 2011, pp. 26–33. [14] S. Zafeiriou and M. Petrou, “Sparse representations for facial expressions recognition via l1 optimization,” in Proc. IEEE CVPRW, Jun. 2010, pp. 32–39. [15] R. Ptucha, G. Tsagkatakis, and A. Savakis, “Manifold based sparse representation for robust expression recognition without neutral subtraction,” in Proc. IEEE ICCV, Nov. 2011, pp. 2136–2143. [16] R. Ptucha and A. Savakis, “Joint optimization of manifold learning and sparse representations,” in Proc. 10th IEEE Autom. Face Gesture Recognit., Apr. 2013, pp. 1–7. [17] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 2790–2797. [18] A. Ghodsi, “ Dimensionality reduction a short tutorial,” Ph.D. dissertation, Dept. Statist. Actuarial Sci., Univ. Waterloo, ON, CA, 2006. [19] L. Cayton, “Algorithms for manifold learning,” Univ. California, San Diego, CA, USA, Tech. Rep. CS2008-0923, 2005. [20] C. Deng, H. Xiaofei, H. Yuxiao, H. Jiawei, and T. Huang, “Learning a spatially smooth subspace for face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–7. [21] L. Qiao, S. Chen, and X. Tan, “Sparsity preserving projections with applications to face recognition,” Pattern Recognit., vol. 43, no. 1, pp. 331–341, 2010. [22] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative learned dictionaries for local image analysis,” in Proc. IEEE Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [23] F. Zang and J. Zhang, “Discriminative learning by sparse representation for classification,” Neurocomputing, vol. 74, nos. 12–13, pp. 2176–2183, 2011. [24] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, et al., “Graph regularized sparse coding for image representation,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1327–1336, May 2011. [25] X. He and P. Niyogi, “Locality Preserving Projections,” in Proc. Adv. NIPS, 2003, pp. 1–8. [26] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving embedding,” in Proc. IEEE ICCV, Oct. 2005, pp. 1208–1213. [27] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proc. 27th Asilomar Conf. Signals, Syst. Comput., Nov. 1993, pp. 40–44.

1750

[28] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger, “Shiftable multiscale transforms,” IEEE Trans. Inf. Theory, vol. 38, no. 2, pp. 587–607, Mar. 1992. [29] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [30] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementationof the K-SVD algorithm using batch orthogonal matching pursuit,” Dept. Comput. Sci., Israel Inst. Technol., Haifa, Israel, Tech. Rep. CS-2008-08, 2008. [31] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learning in face recognition,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 2691–2698. [32] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” in Proc. IEEE Conf. CVPR, Jun. 2010, pp. 3360–3367. [33] Z. Lai, Z. Jin, and J. Yang, “Global sparse representation projections for feature extraction and classification,” in Proc. Chin. Conf. Pattern Recognit., Nov. 2009, pp. 1–5. [34] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2010, pp. 94–101. [35] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643–660, Jun. 2001. [36] M. Valstar, B. Jiang, M. Mehu, M. Pantic, and K. R. Scherer, “The first facial expression recognition and analysis challenge,” in Proc. IEEE Int. Conf. Autom. Face Gesture Recognit. Workshops, Mar. 2011, pp. 921–926. [37] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPost multi-view and 3D human action/interaction,” in Proc. CVMP, 2009, pp. 159–168. [38] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 3, pp. 257–267, Mar. 2001. [39] R. Ptucha and A. Savakis, “Fusion of static and temporal predictors for unconstrained facial expression recognition,” in Proc. 19th IEEE ICIP, Oct. 2012, pp. 2597–2600. [40] H. V. Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Sparse embedding: A framework for sparsity promoting dimensionality reduction,” in Proc. 12th Eur. Conf. Comput. Vis., 2012, pp. 414–427. [41] R. Ptucha and A. Savakis, “LGE-KSVD: Flexible dictionary learning for optimized sparse representation classification,” in Proc. IEEE Conf. CVPRW, Jun. 2013, pp. 854–861.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Raymond Ptucha received the B.S. degree in computer science and electrical engineering from the State University of New York at Buffalo, and the M.S. and Ph.D. degrees in imaging science and computer science from the Rochester Institute of Technology. He was a Research Scientist with Eastman Kodak Company, where he worked on computational imaging algorithms. He holds 16 U.S. patents and has an additional 32 patent applications on file. He is currently an Assistant Professor with the Computer Engineering Department, Rochester Institute of Technology. He is a Secretary of the IEEE Rochester Section, serves as reviewer for the Journal of Electronic Imaging and the Journal of Image Science and Technology, and is on the Technical Committee for the 2014 IEEE International Conference on Multimedia and Expo. His research interests include machine learning, computer vision, and robotics.

Andreas E. Savakis is a Professor of computer engineering with the Rochester Institute of Technology, Rochester, NY, USA, and an affiliated Professor with the Center for Imaging Science and the Doctoral program in computing and information sciences. He received the B.S. (Hons.) and M.S. degrees in electrical engineering from Old Dominion University, Norfolk, VA, USA, and the Ph.D. degree in electrical engineering from North Carolina State University, Raleigh, NC, USA. He held positions with the University of Rochester Medical Center and the Eastman Kodak Research Laboratories before joining Rochester Institute of Technology, where he served as the Department Head of computer engineering from 2000 to 2011. He was a fellow of the American Council on Education from 2011 to 2012. He has co-authored over 100 publications and holds 11 U.S. patents. He received the IEEE Third Millennium Medal in 2000, the NYSTAR Technology Transfer Award for Economic Impact in 2006, the IEEE Region 1 Award for Outstanding Teaching for contributions to Education in Computer Engineering and Multimedia in 2011, and the Best Paper Award at the International Symposium on Visual Computing in 2013. He currently serves as an Associate Editor for the Journal of Electronic Imaging. His research interests include object detection and tracking, expression and activity recognition, manifold learning, sparse representations, and computer vision applications.

LGE-KSVD: robust sparse representation classification.

The parsimonious nature of sparse representations has been successfully exploited for the development of highly accurate classifiers for various scien...
3MB Sizes 5 Downloads 3 Views