1816

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Robust Exemplar Extraction Using Structured Sparse Coding Huaping Liu, Yunhui Liu, and Fuchun Sun

Abstract— Robust exemplar extraction from the noisy sample set is one of the most important problems in pattern recognition. In this brief, we propose a novel approach for exemplar extraction through structured sparse learning. The new model accounts for not only the reconstruction capability and the sparsity, but also the diversity and robustness. To solve the optimization problem, we adopt the alternating directional method of multiplier technology to design an iterative algorithm. Finally, the effectiveness of the approach is demonstrated by experiments of various examples including traffic sign sequences.

Index Terms— Alternating directional method of multiplier (ADMM), robust exemplar extraction, structured sparse coding, traffic sign recognition. I. I NTRODUCTION Automatic exemplar extraction aims to select a small set of the representatives for a specific sample set. It has close relation to the clustering technology but the difference between them is obvious. The clustering technology just clusters different data samples into several groups. The clustering centers, which are not required to be the samples themselves, cannot be directly served as exemplars. Currently, exemplar extraction becomes very important to enable saving computational time and memory requirements of classification algorithms by working on the exemplars. In addition, this problem has extensive applications in video abstraction [3], image collection summarization [19], and other fields. Although exemplar extraction has been extensively studied during the past years, how to effectively select certain samples while preserving the essential characterization of the original sample set and isolating the outliers is still a challenging problem. Recently, sparse coding [9] was successfully utilized for exemplar extraction in [3] and [5], which utilized the self-expressiveness property of the samples [2]. An important merit of such methods is that they can produce desired number of exemplars and supply ranked output exemplars, which is highly expected for practical applications. As a result, it does not incur additional complexity cost when changing configurations such as the number of the exemplars. In addition, [10], [12], and [16] proposed different convex formulations, but only [12] investigated the exemplar extraction problem. However, in many real-world problems, the collection of data includes sample outliers. A method that robustly finds exemplars for Manuscript received April 12, 2013; revised April 12, 2014 and September 2, 2014; accepted September 8, 2014. Date of publication September 24, 2014; date of current version July 15, 2015. This work was supported in part by the National Key Project for Basic Research of China under Grant 2013CB329403, in part by the National Natural Science Foundation of China under Grant 91120011 and Grant 61210013, in part by the Tsinghua Selfinnovation Project under Grant 20111081111, and in part by the Tsinghua University Initiative Scientific Research Program under Grant 20131089295. The authors are with the Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, and also with the State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory of Information Science and Technology, Beijing 100084, (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2357036

the data set is of particular importance, as it reduces the redundancy of the data and removes points that do not really belong to the data set, which [3] and [12] did not consider. Elhamifar et al. [5] discussed how their sparse modeling method can deal with outliers and robustly find exemplars for data sets. Their method is based on the fact that outliers are often incoherent with respect to the collection of the true data. They defined the row-sparsity-index (RSI) of each candidate exemplar to detect the outlier, and a threshold is required to be prescribed. But, this index is very sensitive to noise and even the outlier sample may be selected as the exemplars. In addition, when we perform the sparse coding, the outlier is not isolated, and therefore the obtained solution may be seriously influenced. This will further deteriorate the extraction of exemplars and the isolation of the outliers. Although this problem had early been noticed by [11], [14], and [18], what they concerned was the clustering, but not the exemplar extraction problem. On the other hand, the exemplar diversity is not involved into the sparse coding process and too-close samples may be extracted as the exemplars in [3] and [5]. To tackle this problem, an extra postprocessing module is always required to extract the diversifying exemplars. Such a method may reduce the quality of the extracted exemplars because the sparse coding process is performed prior to diversifying. In fact, the sparse coding models in [3] and [5] did not incorporate the property that the extracted exemplars should not be too similar and cover the full data set as much as possible. Instead, they regarded the data samples separately and neglected the intrinsic correlation between samples. In this brief, we address the problem of robust exemplar extraction from noisy sample set. The main contributions include: 1) we propose a novel structured sparse coding model to simultaneously characterize the reconstruction capability, the sample diversity and the robustness; 2) we design an iterative optimization algorithm to solve the optimization problem; and 3) we design various structure constraints for different data sets and incorporate them into our unified optimization model to obtain promising classification results. The rest of this brief is organized as follows. In Section II, we present the problem formulation and the proposed model. In Section III, we introduce the optimization algorithm. In Section IV, we describe the experiments and evaluation results by comparing our method to the state of the art. Finally, the conclusions are drawn in Section V. Notations: Let M ∈ R r×c . We use superscripts for the rows of M, i.e., M(i) denotes the ith row; and subscripts for the columns of M, i.e., M( j ) denotes the j th column. We will use various matrix norms, here are the notations we use: ||M|| F is the Froubenius norm, which is also equal to (Tr(M T M))1/2 ; ||M||2,1 is the sum of the L2 norm  of the rows of M: ||M||2,1 = ri=1 ||M(i) ||2 ; and ||M||  1,2 is the sum of the L2 norm of the columns of M: ||M||1,2 = cj =1 ||M( j ) ||2 . II. P ROBLEM F ORMULATION The exemplars can be regarded as a set consists of a collection of representatives extracted from the underlying data set. Therefore, exemplar extraction is equivalent to how to select an optimal subset from the entire data set under certain constraints. Consider a

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

matrix D = [d1 , d2 , . . . , d N ] ∈ R d×N, where each column vector denotes a sample represented as a feature vector. The task is to ¯ = [di , di , . . . , di ] ∈ R d×n , where find an optimal subset D n 1 2 i 1 , i 2 , . . . , i n ∈ {1, 2, . . . , N}, such that the original set can be accurately reconstructed and the size n is as small as possible. In practical application, the extracted exemplars should satisfy some desirable properties. These properties are representativeness, sparsity, diversity, and robustness. Representativeness means that the cost to use such exemplars to reconstruct the whole data sample set should be small. Sparsity means that the number of the extracted exemplars should be as small as possible. Diversity is a measure of nonredundancy. A good exemplar set should not contain similar information. Robustness means that the exemplar extraction should be insensitive to the outliers, which often occur in the practical scenario. A straightforward approach is to minimize the following objective function: min ||X||2,1 + λ||D − DX||2F X

(1)

where X ∈ R N×N is the pursuit coefficient matrix, the term ||D − DX||2F is used to evaluate the reconstruction error, and the parameter λ is used to balance different penalty terms. By adding the L2,1 norm term into the objective function, the trivial solution can be avoided and the obtained solution will be row sparse, i.e., most of its rows are zero vectors. The above optimization problem can be efficiently solved [3], [5]. After obtaining the value of X, the L2 norm of the ith row for X, which is denoted as ||X(i) ||2 , is used to evaluate the possibility of the ith sample to be the exemplars. In the original model (1), the sparsity is imposed on the rows of the matrix X, while the reconstruction error is evaluated using the Frobenious norm. It is well known that such a reconstruction error is easily influenced by the outliers. Elhamifar et al. [5] proposed an RSI to detect the outlier after the sparse coding stage. Such a method has two major shortcomings: First, since the outlier detection is performed after the sparse coding stage, the obtained coding vector itself is influenced by the outliers and therefore may be inaccurate. Second, the proposed RSI is suitable for non-noisy case but is very sensitive to the noise (as we shall see in the experimental section). To isolate the outlier, we exploit the fact that the outliers should not be well reconstructed by the exemplars and therefore the corresponding reconstruction errors should be large. That is to say, if the ith sample is outlier, then the ith column of the reconstruction error matrix D − DX will be nonzero and may admit large values. Otherwise, if the ith sample is inlier, then the ith column of the reconstruction error matrix D − DX will be close to zero. On the other hand, it is frequently observed that the outliers are usually a minority in the sample set, i.e., the number of the outliers is usually not too large. We make use of this property and formalize it as the column sparsity regularization term ||D − DX||1,2 . Therefore, we modify the original model (1) by changing the Frobenious norm as the L1,2 norm min ||X||2,1 + λ||E||1,2 , s.t. D − DX = E. X,E

Fig. 1. Illustration of the proposed method. D is composed of N samples. The textured squares represent the data elements in D. Blank squares represent zero elements and colored squares represent the nonzero elements. The rowsparsity of X helps to extract the exemplars and the column-sparsity of E helps to isolate the outliers. In this illustrative example, the sample d2 and d N −1 are extracted as the exemplars and the sample d3 is isolated as the outlier. In our method, an extra diversity term is imposed on X to get more diversifying exemplars (see the text for details).

selected as the exemplars. Concretely speaking, if the ith and j th samples {di , d j } are very similar, then they should not be simultaneously selected as exemplars. If we denote a matrix S ∈ R N×N of which the (ith and j th) element represents the similarity between the ith and j th samples, then we can construct the following new optimization problem: min X,E

N 

||X(i) ||2 Si j ||X( j ) ||2 + λ||E||1,2 , s.t. D − DX = E.

(3)

i, j =1

In this brief, Si j is very important because it controls the mutual inhibition between similar samples. The design of S can be very flexible and we can incorporate some domain knowledge into it. A general guideline is that Si j should be large if di and d j are similar. In Section IV, we will present various examples to show different forms of S.  Strictly speaking, the term i,N j =1 ||X(i) ||2 Si j ||X( j ) ||2 does not encourage the row-sparsity of X directly. To encourage the rowsparsity of X, we can add an extra penalty term ||X||2,1 into the objective function in (3). However, this will introduce more tuning parameters. In fact, even the row-sparsity is not achieved, the exemplar extraction can be easily realized since a ranking strategy which depends on the weighting ||X(i) ||2 can be used to select the top this sense, the structured penalty term  N several (i)exemplars.( j In ) i, j =1 ||X ||2 Si j ||X ||2 is more important than the sparsityencouraging term ||X||2,1 . Further, (3) reduces to (2) if we set Si j to 1/N||X( j ) ||2 in (3). Therefore, the proposed model is an extension of the model in (2), which is a robustness version of the model in (1). Before closing this section, we point that very recently, a similar lateral inhibition model was proposed in [8], of which development was based on L1 norm but not L2,1 norm. In addition, the work in [8] did not address the problem of exemplar extraction and outlier isolation.

(2)

In such a model, the row-sparsity is imposed on X to detect the exemplars and the column sparsity is imposed on the error matrix E to isolate the outliers. By this way, the extracted exemplars will focus on reconstructing the inliers only. This method is shown in Fig. 1. However, the above model still neglects the diversity. As sample points with very small pairwise coherence may lead to too-close exemplars, we need prune the set of exemplars from having tooclose sample points [5]. Such a postprocessing module complicates the design procedure. In practice, the diversity can be characterized by the fact that similar samples have little chance to be simultaneously

1817

III. O PTIMIZATION A LGORITHM The optimization problem in (3) is nonconvex and therefore we should resort some specially designed algorithm to solve it. First, the optimization problem in (3) can be equivalently transformed as min ||WX||2,1 + λ||E||1,2 , s.t. D − DX = E X,E

(4)

where W is a diagonal matrix with the ith diagonal element as  Wii = Si j ||X( j ) ||2 . (5) j ∈{ j |Si j  =0}

1818

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

If we fix W, then the above optimization can be regarded as a weighted L2,1 optimization problems. Here, we adopt the the ADMM [1] to solve it. To this end, we transform the above optimization problem as min

G,X,E

||WG||2,1 + λ||E||1,2 , s.t. D − DX = E, X = G.

(6)

The augmented Lagrangian associated with the above optimization problem is given by L(G, X, E, Y1 , Y2 ) = ||WG||2,1 + λ||E||1,2

μ + Tr(Y1T (D−DX−E))+ ||D−DX−E||2F 2 μ + Tr(Y2T (X − G)) + ||X − G||2F (7) 2 where Y1 and Y2 are the dual variables (i.e., the Lagrangian multipliers), μ is a positive scalar. To find a minimizer of the constrained problem (6), the ADMM algorithm uses a sequence of iterations ⎧ (k) (k) ⎪ G(k+1) = argmin L(G, X(k) , E(k) , Y1 , Y2 ) ⎪ ⎪ ⎪ (k) ⎪ (k+1) = argmin L(G(k+1) , X, E(k) , Y , Y(k) ) ⎪ ⎨X 1 2 (k) (k) (8) E(k+1) = argmin L(G(k+1) , X(k+1) , E, Y1 , Y2 ) ⎪ ⎪ (k+1) (k) (k+1) (k+1) ⎪ ⎪ = Y + μ(D − DX − E )) Y ⎪ 1 ⎪ ⎩ 1(k+1) (k) = Y2 + μ(X(k+1) − G(k+1) )) Y2 until ||D−DX(k+1) −E(k+1) || F ≤  and ||X(k+1) −G(k+1) )|| F ≤ , where  is the tolerance error. In the following, we explain how to solve the optimization problems in (8). First, the optimization over G is equivalent to 2 ||WG|| 2 min L(G) = μ 2,1 + ||G − V|| F G

(9)

(k)

where V = (1/μ)Y2 + X(k). According to [13], the ith row of the optimal solution G can be analytically obtained as  ii (1 − W(i) )V(i) ||V(i) ||2 > μ1 Wii (i) μ||V ||2 (10) G = 0 otherwise second, the optimization over X is equivalent to (k)T

μ D − DX − E(k) + ||D − DX − E(k) ||2F min L(X) = Tr Y1 2 X (k)T

μ (k+1) + Tr Y2 X−G + ||X − G(k+1) ||2F 2 which is equivalent to 1 (k) 2 Y || μ 1 F 1 (k) + ||X − G(k+1) + Y2 ||2F . μ

min L(X) = ||D − DX − E(k) + X

The solution can be obtained as 1 (k) 1 (k) X = (I+DT D)−1 DT D−DT E(k) +G(k+1) + DT Y1 − Y2 . μ μ (11) Finally, the optimization over E is equivalent to min L(E) = E

2λ ||E||1,2 + ||E − U||2F μ

(12)

Algorithm 1 Optimization Algorithm Input: Data set D ∈ R d×N Output: Solutions X ∈ R N×N and E ∈ R d×N ————————– 1: Initialize G, X, E, Y1 , and Y2 with appropriate dimensions. 2: while Not convergent do 3: Update G, X, E, Y1 , Y2 according to (8). 4: Update W according to (5). 5: Update μ as min(1.1μ, 1010 ). 6: end while

a whole algorithm summary which includes the above optimization procedures is given in Algorithm 1. Although our problem is not convex and we do not have any of the normal guarantees, empirically, the proposed iterative algorithm works very well and converges after ∼30–40 iterations, regardless of the initialization. After deriving X, we use the weight ||X(i) ||2 to rank the samples. The larger ||X(i) ||2 is, the more important this sample is. We can either extract a fixed number of the most important samples or set a threshold and select the samples of which ||X(i) ||2 is larger than this value. Note that no extra postprocessing module such as the one used in [5] is needed in our approach. In addition, if one is interested in the outliers, he can even use the similar method on column vectors of the obtained solution E to extract the outliers for further analysis. Remark 3.1: Please note that the objective function in (3) is nonconvex. Currently, the nonconvex optimization is a hot topic but the theoretic analysis is very challenging [4]. In this brief, a multistage optimization method is proposed to solve the concerned problem. Please note that if our method terminate within one iteration, then it is equivalent to (2) and does not directly encourage diversity. Therefore, the obtained solution is a refinement of the global solution for the initial convex relaxation and intuitively one expects that this solution is better than the standard one-stage convex relaxation which is equivalent to (2). The intuition is to refine convex relaxation iteratively by using solutions obtained from earlier stages. This leads to better and better convex relaxation formulations, and thus better and better solutions. The multistage methods have been investigated in [7] and [20], but how to analyze the convergence under the proposed ADMM formulation is still an open problem. To improve the local convergence, we update the parameter μ at each iteration. Empirically the developed algorithm always converges within 100 iterations. The theoretical issue about local convergence will be our future work. Remark 3.2: In the initialization stage of algorithm 1, we usually initialize the matrices to zero matrices. Such a trivial initialization works well in our experiments. However, since the proposed model (3) is nonconvex, we can try multiple random initializations to pick the one which leads to the minimum objective values. IV. E XPERIMENTAL R ESULTS In this section, we first present a numerical example to validate the outlier rejection capability. Then, we design various structured constraint matrices for different applications to show the effectiveness of the structured sparse coding.

(k)

where U = (1/μ)Y1 + D − DX(k+1) . Similar to the solution of G, the j th column of the optimal solution E can be analytically obtained as  λ (1 − μ||Uλ || )U( j ) ||U( j ) ||2 > μ ( j) 2 (13) E( j ) = 0 otherwise

A. Numerical Example The numerical example, which is designed to fairly evaluate the proposed method and the highly related SRMS method proposed in [5], is generated as follows. First, we randomly generate ten 50-D column vectors d1 , d2 , . . . , d10 , and use the equation

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 2.

1819

Numerical results—SRMS method.

Fig. 4. Classification results on face data set. Left: comparison between various exemplar extraction methods. Please note that the error bars represent standard deviations among 50 runs. Right: classification accuracy of the proposed method with different parameters λ ∈ [10−5 , 105 ].

correspond to the ideal case, and the top-right and bottom-right panels correspond to the noisy case. From these results, we can see that even in the noisy case our method can discriminate the exemplars and outliers accurately. Fig. 3.

Numerical results—proposed method.

10

d j = i=1 αi j di for j = 11, 12, . . . , 250, where αi j ∈ (0, 1) is randomly generated scalar, to generate the other 240 vectors. Then, we construct a matrix D ∈ R 50×250 as D = {d1 , d2 , . . . , d250 }. Upon this setting it is reasonable to deduce that the first 10 atoms d1 –d10 should be detected as exemplars. Finally, we replace d j for j = {20, 50, 100, 120, 130, 140, 150, 180, 190, 200} as newly generated random vectors. Therefore, 10 outliers are incorporated into the sample set. We then preprocess all the columns of D to have unit L2 norm. In the top-left panel of Fig. 2, we show the L2 norm of each row of the solution X obtained by SRMS method [5]. To make clear illustration, we add red dots on the x-axis to mark the groundtruth exemplars and the green dots to mark the ground-truth outliers. It shows that both exemplars and outliers obtain large weights. To further discriminate the exemplars and the outliers, we calculate the RSI according to [5] and plot them in the bottom-left panel of Fig. 2, where the red bars denote the RSI values for the 10 exemplars, while the green bars denote the RSI values for the 10 outliers. Indeed, the difference is rather obvious and therefore this index can be effectively utilized to detect the outliers. To perturb D, we first sample a random δD ∈ R 50×250 , of which elements are drawn from the distribution N (0, 0.1). We then create ¯ = D + δD and regard D ¯ as the noisy data sample set. In addition, D the L2 normalization on each column should be performed before further processing. For convenience, we call this case as noisy case and the former one as the ideal case. The corresponding results for noisy case are shown in the top-right and bottom-right panels of Fig. 2. In this case, some outliers obtain larger weights than the true exemplars. Even worse, the RSI values of the exemplars and the outliers are very close and therefore it is difficult to isolate the outliers. From this example, we also see that the RSI is very sensitive to the data noise. For the same two sets of data (ideal case and noisy case), we run the proposed algorithm. Since in this numerical example no specific structure is exploited and we just want to compare the robustness with the method in [5], we solve the optimization problem in (2) with λ = 10. The L2 norm of each row of X and each column of E are shown in Fig. 3, where the top-left and bottom-left panels

B. Face Recognition Similar to [5], we evaluate the classification performance on the extended YaleB face database which includes 38 classes. For each class we randomly choose 51 samples for training and extract the exemplars and use the remaining samples in each class for testing. We compare the proposed method with several baselines or stateof-the-art methods: K-medoids, SRMS [5], DIS [6], convex clustering [12], and similarity-based robust cluster algorithm (SCA) [18]. For all the methods, 14 exemplars are selected for each class. All of the images are resized to 24 × 21 pixels, with 256 gray levels per pixel. Thus, each face image is represented as a 504-D vector with unit L2 norm. In the proposed exemplar extraction approach, we fix λ = 10 for all cases, unless explicitly stated otherwise. To encourage the diversification and prevent similar samples from being selected simultaneously, we define Si j in the optimization model in (3) as the inner product between the feature vectors of the ith and j th samples and Sii = 0 for any i. This setting well characterizes the similarity between different samples. To evaluate the robustness performance, we randomly select ρ fraction for each class with 51 training samples for corruption. The corruption is simply conducted by changing 20% pixels of the corresponding image as the salt-pepper noise. For different values of ρ within the interval [0, 0.5] with the space 0.1, we run the above exemplar extraction methods and use the linear support vector machine (SVM) as the classification methods. Besides the abovementioned methods, we include a new method named SRMS-RSI, which means that an RSI-based postprocessing is utilized to reject the outliers (the details can be found in [5]). The classification accuracy is reported in the left panel of Fig. 4, where the result using all training samples is shown with a dashed line. Note that we repeat K-Medoids and the proposed method for 50 times and report the average results and the standard deviations among those runs. When ρ = 0, i.e., there is no outlier, our results are very close to the results using all training samples. The introduction of RSI seems do not improve the classification performance obviously. The reason is that RSI is sensitive to noise (as is validated in the above section) and does not provide reliable discrimination between inliers and outliers. From these results, we can see that the performance gain using the proposed exemplar extraction method is obvious, and K-Medoids,

1820

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

Fig. 6. Example of a track in GTSRB data set (all images are normalized to the the same size). The 1–15th images are shown in the first row and the 16–30th images are shown in the second row. From this example, we can see that the consecutive images in one single track are indeed similar. In addition, the content changes dramatically at the second image due to the reason that the camera is bumped strongly and recovers at once.

Fig. 5. Classification results on ISOLET data set. Left: comparison between various exemplar extraction methods. Please note that the error bars represent standard deviations among 50 runs. Right: classification accuracy of the proposed method with different parameters λ ∈ [10−5 , 105 ] and σ ∈ [10−5 , 105 ].

DIS, and convex clustering methods perform not well since they are not equipped with outlier rejection mechanism. The SCA, which is design for robust clustering, also performs worse than the proposed on when the noise is strong. Finally, we present a result to show the parameter sensitivity. We change the regularization parameter λ in (3) from 10−5 to 105 , and run the proposed algorithm1 on the data set with different noise levels. The classification accuracies are shown in the right panel of Fig. 4, from which we find that as λ increases, the classification performance does not monotonically increases. The best λ all appear in the range of [1, 10], which means that both the reconstruction error term and the diversification term play important roles in finding good solution.

C. ISOLET Classification We evaluate the above-mentioned methods on the ISOLET classification data set (26 classes). The original training/testing split is used for evaluation, i.e., we have ∼240 training samples with 617 dimensions for each class. To add noise, we randomly select ρ fraction for each class for corruption. The corruption is simply conducted by changing the feature vector as a random vector. We use the above-mentioned methods to extract 48 exemplars for each class. For our optimization model in (3), we impose a structure constraint by setting Si j = exp(−||di − d j ||22 /σ ) for i = j , and Sii = 0, where σ = 10, to prevent similar samples to be simultaneously extracted. In the left panel of Fig. 5 we show the classification accuracy using the SVM classifier for different noise levels. This again validates the effectiveness of the proposed methods. To investigate the parameter sensitivity, we change the the parameter λ in (3) from 10−5 to 105 , and the parameter σ from 10−5 to 105 , and then run the proposed algorithm on the data set with noise level 0.5. The classification accuracy is shown in the right panel of Fig. 5, from which we can see both the best λ and σ all appear in the range of [1, 10]. Actually, when σ is extremely large, Si j will be 1 for most of the pairs (i, j ), and then the first term in (3) becomes overwhelming and may deteriorate the reconstruction performance. On the other hand, when σ is small, Si j will be zero for most pairs of (i, j ) and the diversification term becomes weak. From this sensitivity plot, we conclude that a properly designed diversification term indeed plays important role in extracting exemplars. 1 In this case, we run the proposed method using the zero initialization.

Fig. 7.

Classification accuracy on the GTSRB data set.

D. Traffic Sign Recognition We evaluate the methods on the recent German Traffic Sign Recognition Benchmark (GTSRB) [17]. This data set consists of 43 classes with 39 209 training images and 12 630 testing images in total. It is strongly unbalanced since the number of training samples in each class varies between 210 (for the first class) and 2250 (for the third class). In fact, the training sample set includes 1307 tracks and keeps multiple (∼30) images per track. Consecutive images in one track are similar and therefore the strong data redundancy is introduced. Here, we investigate how the exemplar extraction can be used to reduce the data redundancy. For each class, we use the abovementioned methods to extract r representative training samples, where r is set to be within the interval [10, 100] with the space 5. For our method, we exploit an specific temporal structure information: since consecutive images of a traffic sign in one track are similar (see Fig. 6) and they do not contribute too much to the diversity of the data set, we do not wish to extract too many exemplars from one single track. Therefore, we define Si j = 1 if i = j and the ith and the j th samples are from the same track, otherwise it is set to 0. In all the experiments, we use the HoG2 feature which is provided with the GTSRB data set as the feature vector. In Fig. 7, we report the classification accuracy using SVM classifier. For reference, we also include the result using all of the training samples (96.05%). This figure shows that when the number of the exemplars is small, our method significantly outperforms other ones. The main reason is that our model explicitly incorporates the diversity term. The role of the diversity becomes weak when the number of the exemplars increases, as we see that the performance gap between the proposed method and SRMS becomes small when r > 80. Finally, to show the proposed method can isolate the outliers, one of which is shown in Fig. 6, we annotated 121 outliers in this data set. The proposed method can successfully detect 106 outliers, while SRMS-RSI can only detect 34 outliers from them. V. C ONCLUSION In this brief, a structured sparse coding model is proposed to simultaneously characterize the representativeness, diversity, and robustness which should be satisfied in exemplar extraction task.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 8, AUGUST 2015

The L1,2 norm is introduced to isolate the outlier and improve the robustness, and the inhabitation term is introduced to prevent some samples to be simultaneously selected. The validations on various data sets show that the proposed model obtains promising results. R EFERENCES [1] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011. [2] H. Cheng, Z. Liu, L. Hou, and J. Yang, “Sparsity induced similarity measure and its applications,” IEEE Trans. Circuits Syst. Video Technol., to be published. [3] Y. Cong, J. Yuan, and J. Luo, “Towards scalable summarization of consumer videos via sparse dictionary selection,” IEEE Trans. Multimedia, vol. 14, no. 1, pp. 66–75, Feb. 2012. [4] Y. Deng, Q. Dai, R. Liu, Z. Zhang, and S. Hu, “Low-rank structure learning via nonconvex heuristic recovery,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 383–396, Mar. 2013. [5] E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in Proc. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 1600–1607. [6] E. Elhamifar, G. Sapiro, and R. Vidal, “Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 19–27. [7] P. Gong, J. Ye, and C. Zhang, “Multi-stage multi-task feature learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2012, pp. 1–8. [8] K. Gregor, A. D. Szlam, and Y. LeCun, “Structured sparse coding via lateral inhibition,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2011, pp. 1–9.

1821

[9] R. He, W.-S. Zheng, B.-G. Hu, and X.-W. Kong, “Two-stage nonnegative sparse representation for large-scale face recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 35–46, Jan. 2013. [10] T. D. Hocking, A. Joulin, F. Bach, and J.-P. Vert, “Clusterpath: An algorithm for clustering using convex fusion penalties,” in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2011, pp. 1–8. [11] M. F. Jiang, S. S. Tseng, and C. M. Su, “Two-phase clustering process for outliers detection,” Pattern Recognit. Lett., vol. 22, nos. 6–7, pp. 691–700, May 2001. [12] D. Lashkari and P. Golland, “Convex clustering with exemplar-based models,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2007, pp. 1–8. [13] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010, pp. 663–670. [14] G. J. McLachlan and D. Peel, “Robust cluster analysis via mixtures of multivariate t-distributions,” in Advances in Pattern Recognition. Berlin, Germany: Springer-Verlag, 1998, pp. 658–666. [15] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint 2,1-norms minimization,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2011, pp. 1–8. [16] K. Pelckmans, J. De Brabanter, J. Suykens, and B. De Moor, “Convex clustering shrinkage,” in Proc. Workshop Statist. Optim. Clustering (PASCAL), Jul. 2005, pp. 1–6. [17] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition,” Neural Netw., vol. 32, pp. 323–332, Aug. 2012. [18] M.-S. Yang and K.-L. Wu, “A similarity-based robust clustering method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 4, pp. 434–448, Apr. 2004. [19] C. Yang, J. Peng, and J. Fan, “Image collection summarization via dictionary learning for sparse representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 1122–1129. [20] T. Zhang, “Analysis of multi-stage convex relaxation for sparse regularization,” J. Mach. Learn. Res., vol. 11, pp. 1081–1107, Mar. 2010.

Robust Exemplar Extraction Using Structured Sparse Coding.

Robust exemplar extraction from the noisy sample set is one of the most important problems in pattern recognition. In this brief, we propose a novel a...
829KB Sizes 0 Downloads 9 Views