IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

929

Sparse Coding from a Bayesian Perspective Xiaoqiang Lu, Yulong Wang, and Yuan Yuan, Senior Member, IEEE

Abstract— Sparse coding is a promising theme in computer vision. Most of the existing sparse coding methods are based on either 0 or 1 penalty, which often leads to unstable solution or biased estimation. This is because of the nonconvexity and discontinuity of the 0 penalty and the over-penalization on the true large coefficients of the 1 penalty. In this paper, sparse coding is interpreted from a novel Bayesian perspective, which results in a new objective function through maximum a posteriori estimation. The obtained solution of the objective function can generate more stable results than the 0 penalty and smaller reconstruction errors than the 1 penalty. In addition, the convergence property of the proposed algorithm for sparse coding is also established. The experiments on applications in single image super-resolution and visual tracking demonstrate that the proposed method is more effective than other state-ofthe-art methods. Index Terms— Bayesian, compressive sensing (CS), computer vision, maximum a posteriori (MAP), sparse coding.

I. I NTRODUCTION

S

PARSE coding has been widely used as a promising tool in neural network and image processing and achieved great success [1]–[6]. The main goal of sparse coding is to find a sparse representation from an over-complete basis set. An appropriate dictionary can be interpreted as a set of features, which can succinctly and accurately represent the object. Specifically, it is known that most images can be well represented adaptively and sparsely by an appropriately chosen over-complete dictionary. This motivates the emergence of many efficient sparse coding algorithms [7]–[10]. These algorithms provide high performance, especially in applications, such as super-resolution [11] and visual tracking [12], [13]. Sparse coding attracted an increasing amount of interest in recent years [14]–[17]. Yang et al. [11] applied sparse representation into super-resolution by jointly learning two

Manuscript received March 31, 2012; revised December 25, 2012; accepted January 26, 2013. Date of publication March 7, 2013; date of current version April 5, 2013. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2011CB707104 and by the National Natural Science Foundation of China under Grant 61100079, Grant 61172143, and Grant 91120302. X. Lu and Y. Yuan are with the Center for Optical Imagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China (e-mail: [email protected]; [email protected]). Y. Wang is with the Center for Optical Imagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China, and also with the Faculty of Mathematics and Computer Science, Hubei University, Wuhan 430062, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2245914

dictionaries for low-resolution and high-resolution, respectively. Reference [18] generalized the K-means by iteratively changing the coefficient matrix and the dictionary matrix. They also applied the proposed K-SVD algorithm into image denoising. Mairal et al. [19] devised a discriminative approach to build a dictionary rather than reconstructing a dictionary adapted to training data in a classical way. All these methods are based on the 0 penalty or the 1 penalty [20]. A conventional method to reduce the sparse coding problem is to use the 0 penalty since it directly places penalties on the number of nonzero coefficients, and thus is desirable for sparse coding. However, this penalty often results in NP-hard problems and unstable solutions because of its nonconvexity and discontinuity [21]. Therefore, appropriate relaxations of the 0 penalty are required to reduce the problem complexity effectively. One of the most popular approximations to the 0 penalty is the 1 penalty, which leads to the classic method LASSO [22]. In practice, LASSO often generates sparse solution and enjoys some attractive statistical properties. Moreover, the corresponding minimization optimization problem can be efficiently implemented. Although the 1 penalty has good performance in sparse coding, it may cause biased estimation since it penalizes true large coefficients more, and thus may produce over-penalization [23]. Consequently, it is necessary to find a way to overcome the disadvantages of the 0 and 1 penalties and achieve better performance for sparse coding. From the viewpoint of statistics, when fixing the dictionary D, sparse coding shares a number of similarities with variable selection. In variable selection problems, Bayesian framework has been successfully applied to select variables by enforcing appropriate priors. For example, George and Mcculloch [24] considered a Gaussian-based spike and slab prior for variable selection with Markov chain Monte Carlo. Yen [25] also discussed the same prior but adopted a majorization–minization technique to tackle the generated objective function. In addition, Laplace priors were used to avoid overfitting and enforce sparsity in the Bayesian logistic regression approach for text data [26]. Motivated by the success of Bayesian models in variable selection in statistics, a novel method is developed by carrying out maximum a posteriori (MAP) [27], [28] estimation for a Bayesian model to reduce sparse coding problem. In fact, the MAP estimation in sparse linear model has been previously studied by Seeger et al. [29] using Laplace priors. However, the resulting penalty in the objective function is essentially the 1 penalty, which also results in over-penalization. The difference between our model and theirs is that each coefficient is assigned with a Bernoulli variable to further enforce sparsity and reduce over-penalization. The MAP estimation of the

2162-237X/$31.00 © 2013 IEEE

930

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

proposed model leads to a new penalty, which turns out to be a mixture of the 0 and the 1 penalties. Since the new penalty still involves the 0 norm, which is difficult to tackle during optimization, a tight approximation of the 0 penalty is used [21]. The main contributions of this paper are as follows. 1) A novel sparse coding algorithm is developed by using the Bayesian framework. To the best of our knowledge, such a framework for sparse coding has not been proposed before. Furthermore, the obtained new penalty can generate more stable solutions than the 0 penalty and smaller reconstruction error than the 1 penalty. 2) To obtain robust and sparse representations, a novel softthresholding scheme is derived in the greedy coordinate descent algorithm for sparse coding. In addition, the convergence property of the proposed algorithm for sparse coding is also established. 3) The proposed sparse coding method is applied to two different applications, including single-image superresolution and visual tracking. Comparisons are made between the proposed method and the existing sparse coding methods that use either the 0 or 1 penalty. The experimental results further demonstrate the capability that our method can produce robust and accurate representations. The rest of this paper is organized as follows. Section II briefly discusses the general framework of the sparse coding problem. Section II introduces the probability model for the coefficients under Bayesian framework. Based on the model, the new sparse coding method is presented in Section IV. To verify the effectiveness and robustness of the proposed method, the experimental results on single image superresolution and visual tracking with the proposed method are reported in Section V. Section VI concludes. II. 0 AND 1 C ONSTRAINED S PARSE C ODING Given the data set Y = {y1 , y2 , . . . , y M } ⊂ R K , sparse coding aims at finding the basis vectors d1 , d2 , . . . , d N ∈ R K and a sparse coefficient vector xi ∈ R N (i = 1, 2, . . . , M) to approximately represent each input vector yi ∈ R K (i = 1, 2, . . . , M) such that yi ≈

N 

dl xli (i = 1, 2, . . . , M).

(1)

l=1

To tackle the problem in Eq. (1), a general way is to alternatively optimize an appropriate objective function with respect to the dictionary D( [d1 , d2, . . . , d N ]) and the coeffi matrix   cient matrix X  x1T , x2T , . . . , xTN . When holding X fixed, the dictionary D can be updated by solving the Lagrange dual of the objective function [7]. Various penalties within the regularized least square framework have been adopted for sparse coding in the literatures [11], [18]. A natural way to reduce the sparse coding model is to use the 0 penalty, and the corresponding optimization problem can be described as min D,X

m 1 1 yi − Dxi 22 + λxi 0 m 2 i=1

s.t. dl 22 ≤ c, l = 1, 2, . . . , N.

(2)

It directly places penalties on the number of nonzero coefficients, and thus is desirable to produce accurate and sparse solution. However, it is usually very difficult to implement the corresponding optimization problem since it is NP-hard. Moreover, the 0 penalty often leads to unstable solutions because of its nonconvexity and discontinuity. Another widely used penalty is the 1 penalty, which leads to the classic method LASSO [22]. It addresses the sparse coding problem by solving the following optimization problem: m 1 1 yi − Dxi 22 + λxi 1 min D,X m 2 i=1

s.t. dl 22 ≤ c, l = 1, 2, . . . , N.

(3)

The problem in Eq. (3) can be effectively implemented by many existing efficient algorithms, such as Feature-sign search algorithm [7], LARS [9], and greedy coordinate descent algorithm [30]. However, LASSO may lead to biased estimation during sparse coding, since it penalizes true large coefficients more and thus may produce over-penalization. Moreover, a Bayesian framework with Laplace prior has also been previously studied for sparse coding [29]. The Bayesian framework is capable of encoding prior knowledge and making valid estimation of uncertainty [29], [31], [32]. In addition, the probabilistic model in [29] may induce sparsity and effectively perform sparse coding. However, the resulting penalty in the objective function is essentially the 1 penalty, which also may lead to biased estimation for large coefficients. To make full use of the advantages of the Bayesian framework and overcome the aforementioned drawbacks, a novel Bayesian framework is proposed for sparse coding in this paper. III. BAYESIAN F RAMEWORK In this section, a novel Bayesian framework is proposed for sparse coding. If yi is viewed as the i th vectorized patch of a noisy image Y , then the patch-based image model can be formulated as yi = Dxi +  i

(4)

where  i stands for the noise vector. Each coefficient xli (l = 1, 2, . . . , N, i = 1, 2, . . . , M) is assigned with a index variable rli = I(xli = 0). By Bayes’ theorem, for i = 1, 2, . . . , M, the joint posterior density of xi , ri and σi2 can be written as f (xi , ri , σi2 |D, yi , μ, τ1 , τ2 , κ) ∝ f (yi |D, σi2 , xi , ri ) f (xi |σi2 , μ) f (σi2 |τ1 , τ2 ) f (ri |κ) (5) where f (yi |D, σi2 , xi , ri ), f (xi |σi2 , μ), f (σi2 |τ1 , τ2 ) and f (ri |κ), respectively, denote the priors on the noisy vectorized image patch, the coefficient vector, the noise level, and the index vector ri = [r1i , r2i , . . . , r Ni ]. In Eq. (5), the parameters μ, τ1 , τ2 , and κ are the relevant parameters of the assumed distributions of the priors, which will be introduced in detail later. The priors are presented as follows.

LU et al.: SPARSE CODING FROM A BAYESIAN PERSPECTIVE

931

Prior f (yi |D, σi2 , xi , ri ). With the definition of the index variable rli , Eq.4 can be rewritten as yi j =

N 

d j l rli xli +  j , j = 1, 2, . . . , K .

Combining the above priors and 5, we have −2 log f (xi , ri , σi2 |D, yi , μ, τ1 , τ2 , κ) K N N  2μ  1  2 (yi j − d j l rli xli ) + 2 |xli | = 2 σi j =1 σi l=1 l=1

(6)

l=1

It is common to assume that the noise  j follows the Gaussian distribution, i.e.,  j ∼ Normal(0, σi2 ) [25]. This implies yi j |d j , xi , ri , σi2

∼ Normal

 N

2 l=1 d j l rli x li , σi

 .

+(2N + K + 2τ1 + 2) log σi2 + +

(7)

Prior f (xi |σi2 , μ). To enforce sparsity, the coefficients are assigned with Laplace priors [29]. For l = 1, 2, . . . , N, i = 1, 2, . . . , M   (8) xli |σi2 , μ ∼ Laplace 0, σi2 μ−1 .

l=1

∼ Inverse-Gamma(τ1 , τ2 ).

(9)

In practice, all the σi2 s are assumed to share the same fixed value for the sake of simplicity. Actually, the assumption is normal and in accordance with the sparse coding literatures [7], [14], [17]. For example, in [17], the standard generative model is always formulated as y = Dx +  where y stands for the noisy observation vector, D is the dictionary, x is the coefficient vector, and  represents the noise vector. In the above model, the noise vector  is always assumed to be distributed as a zero-mean Gaussian distribution with covariance σ 2 in the literatures aforementioned. Moreover, all samples yi s are assumed to satisfy the standard generative model above. This means that all noise vectors  i s have the same standard deviation σ [7]. Prior f (ri |κ). Although the aforementioned Laplace priors can induce sparsity and enjoy good statistical properties in sparse coding, the resulting 1 penalty may lead to overpenalization on the large coefficients and generates biased estimation. To further enforce sparsity and reduce overpenalization caused by the Laplace priors, the index variable rli of each coefficient xli is assumed to be a Bernoulli variable [25]. For l = 1, 2, . . . , N, i = 1, 2, . . . , M rli |κ ∼ Bernoulli(κ)

(10)

where κ ≤ 1/2. Here, the Bernoulli prior on rli means that if the prior information is known, rli will have probability κ to be 1 and 1 − κ to be 0. Then MAP estimator for xi , ri , σi2 can be obtained by (ˆxi , rˆ i , σˆ i2 ) = arg min {−2 log f (xi , ri , σi2 |D, yi , μ, τ1 , τ2 , κ)}. xi ,ri ,σi2

rli log

(1 − κ)2 + const. κ2

(11)

Based on the proposed Bayesian model, the objective function is detailed in the next section. IV. M ETHODOLOGY With fixing σi2 = 1, Eq. (11) can be rewritten as

Prior f (σi2 |τ1 , τ2 ). The variances of noises are assumed to follow Inverse-Gamma distribution [25]. For i = 1, 2, . . . , M σi2 |τ1 , τ2

N 

2τ2 σi2

yi − Dxi 22 + 2μxi 1 + ρκ xi 0 + const i = 1, 2, . . . , M

(12)

where ρκ = log (1 − κ)2 /κ 2 . For simplicity, we consider the equivalent form of Eq. (12) as 1 yi − Dxi 22 + λγ xi 1 + λ(1 − γ )xi 0 + const 2

(13)

where γ ∈ [0, 1], λ = μ + 1/2ρκ and γ = 4μ/2μ + ρκ . By observing the objective function in Eq. (13), it can be found that the essential penalty in Eq. (13) is a combination of the 0 and the 1 penalties. Then the sparse coding problem can be transformed as min{L(X, D; λ, γ ) := D,X

m 1  L i (X, D; λ, γ )} m

(14)

i=1

where L i (X, D; λ, γ ) =

1 yi − Dxi 22 + λγ xi 1 2 +λ(1 − γ )xi 0 .

With a fixed appropriate dictionary D, the optimization problem of Eq. (14) can be decomposed into n optimization problems, each of which is min xi

∈R N

1 yi − Dxi 22 + λγ xi 1 + λ(1 − γ )xi 0 . 2

(15)

Since Eq. (15) still involves 0 norm, which is not continuous and thus difficult to handle, the optimization problem in Eq. (15) is relaxed as min xi

∈R N

1 yi − Dxi 22 + λγ xi 1 + λ(1 − γ )L 0 (xi ) (16) 2

N N  where L 0 (u) = j =1 L 0 (u j ) = j =1 [I (|u j | ≥ ) + (|u j |/)I (|u j | < )] [21]. It can be found that the limit of L 0 (u) is u0 as  → 0.

932

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

A. Greedy Coordinate Descent It is common to solve the optimization problem in Eq. (16) by exploiting the greedy coordinate descent method [30]. However, it is worthy to point out that convexity is a necessary condition in the original coordinate descent or greedy coordinate descent algorithm. In fact, the objective function in Eq. (16) is not convex with respect to the coefficient vector xi , which makes it difficult to perform sparse coding. For this reason, a novel soft-thresholding scheme is derived for the objective function in Eq. (16) and the convergence property of the proposed algorithm is also established. The basic idea of the greedy coordinate descent algorithm is to adopt adaptive and relatively greedy sweeps during the process of finding the solution. In particular, the greedy selection of the j th coordinate is to produce the largest decrease in the objective function at each iteration; that is

Algorithm 1 Sparse Coding Using the Combined Penalty Task: Find the sparse coefficients of the given signal yi with the fixed dictionary D. Input: Input signal yi , the dictionary D, regularization parameters λ and γ . 1) Initialization: Set xi = 0, β = D T yi , wl = dl 22 for l = 1, 2, . . . , N, and w = [w1 , w2 , . . . , w N ]. Denote el as the vector with a 1 in the lth coordinate and 0’s elsewhere. 2) Repeat until convergence: a) Calculating β , λ)./w; x˜ i = Sγ (β b) Greedy selection h = arg max Hl ; l

h = arg max Hl l

(17)

where Hl is the deviation of the objective function in Eq. (16) corresponding to the change of the lth coordinate of xi . Therefore, the greedy coordinate descent algorithm is superior to the general pathwise coordinate descent algorithm both in accuracy and speed. Then, the proposed method can find the solutions of D and X by alternatively optimizing the objective function of Eq. (16) with respect to the dictionary matrix D and the coefficients matrix X. Accordingly, the proposed method can be divided into two stages. The first stage updates X with the fixed D by utilizing the greedy coordinate descent algorithm. The second stage updates D with the fixed X by using a Lagrange dual method. Recall that the cost function in Eq. (16) is not convex with the coefficient vector xi , which makes it hard to directly apply the original greedy coordinate descent algorithm [30]. Consequently, a generalized soft-thresholding scheme is derived for the proposed sparse coding algorithm and its convergence is also proved. B. Sparse Coding In this section, the main goal is to find the sparse coefficients of each yi with fixed dictionary D. The complete algorithm for sparse coding is proposed in Algorithm 1. Specifically, given the current update xit , the next step using greedy coordinate descent for sparse coding is to change its hth coordinate t +1 = Sγ ( x hi

K 

d j h (yi j − y˜ihj ), λ)

(18)

j =1

 where y˜ihj = l =h d j l x li and Sγ (·, ·) as the generalized thresholding operator is defined as follows:  √ 0, |θ | ≤ λγ + 2λ(1 − γ ) Sγ (θ, λ) = √ sgn(θ )(|θ | − λγ ), |θ | > λγ + 2λ(1 − γ ) where θ ∈ R is a scalar. In Algorithm 1, Hl represents the deviation of the objective function corresponding to the change

c) Updating xi and β xlit +1 = xlit , l = h, t +1 x hi = x˜hi ,

t − x˜hi |(D T D)eh , β t +1 = β t − |x hi

βht +1 = βht .

Output: Sparse coefficient vector xi .

of the lth coordinate of xi Hl = H (x 1i , x 2i , . . . , xli , . . . , x ni ) −H (x 1i , x 2i , . . . , x˜li , . . . , x ni ). In the following, we will establish the convergence property of Algorithm 1. Lemma 1: Consider the optimization problem min F(x) = x∈R

1 (x − θ )2 + λγ |x| + λ(1 − γ )L 0 (x). (19) 2

√ Assume  < λ(1 − γ )/2, then the solution is x˜ = Sγ ( f, λ), where  √ 0, |θ | ≤ λγ + 2λ(1 − γ ) Sγ (θ, λ) = √ sgn(θ )(|θ | − λγ ), |θ | > λγ + 2λ(1 − γ ). Remark 1: Recall that the limit of L 0 (x) approximates x0 as  → 0. So the value of  is quite √ small in our experiments. Therefore, the assumption  < λ(1 − γ )/2 in Lemma 1 is reasonable and can be easily satisfied. The following lemma from [30] is also necessary to establish the convergence property of the proposed algorithm for sparse coding. Lemma 2: v = (x 1 , x 2 , . . . , x j −1 , x j +1 , . . . , x N ) and g(v) = arg min x j H (x 1, x 2 , . . . , x j , . . . , x N ), then g is continuous for any j . Based on Lemma 2, Proposition 1 is proposed to illustrate the convergence property as follows. Proposition 1: Suppose {x1 , x2 , x3 , . . .} be the sequence of results produced by our greedy coordinate descent Algorithm 1. Then for all t = 1, 2, . . ., 0 ≤ H (xt +1) ≤ H (xt ).

LU et al.: SPARSE CODING FROM A BAYESIAN PERSPECTIVE

933

C. Dictionary Learning This section discusses the method to learn dictionary D while holding the sparse coefficient matrix X fixed. The corresponding optimization problem can be formulated as D

s.t. dl 22 ≤ c, l = 1, 2, . . . , N.

(20)

Recently, a number of efficient methods [7], [18], [33]–[35] have been developed to tackle the dictionary learning problem in Eq. (20). Here, the Lagrange dual-based dictionary learning method in [7] is adopted for its efficiency and simplicity. V. E XPERIMENTS In this section, the proposed method is applied into single-image super-resolution and visual tracking, respectively. To further demonstrate the effectiveness and robustness of the proposed method, comparisons are also made on the above applications between the proposed method and the 1 penalty based methods.

1) Parameter Settings: In this section, we discuss and analyze the effects of the parameters λ and γ in the final sparse coding performance with simulated data. This can help to gain a better understanding of the proposed method and adopt appropriate parameter settings to obtain better performance. First, the simulated data is generated as follows: y = Dx + .

2 1.5

Fig. 1. Recovery error results as a function of the regularization parameter λ for a problem with γ = 0.9. λ=0.05 2.2 2 1.8 1.6 1.4

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The balance parameter γ

1

Fig. 2. Recovery error results as a function of the regularization parameter γ for a problem with λ = 0.05.

A. Experiments With Simulated Data

(21)

is the observation data. D ∈ In Eq. (21), y ∈ stands for a random dictionary matrix, generated with i.i.d. entries from a standard normal distribution N(0, 1). For simplicity, the columns of D are normalized to unit 2 norm. x ∈ R400 represents the ground-truth randomly generated sparse coefficient vector, in which the locations of the nonzero coordinates are random and independent and the value of the nonzero coordinates are also from the standard normal distribution N(0, 1).  ∈ R100 denotes the randomly generated zero mean Gaussian noise with the standard deviation σ = 0.01. While performing sparse coding, only the observation data y and the dictionary D are known. First, we consider the effects of the regularization parameter λ to the recovery performance by fixing the balance parameter γ = 0.9. It is worthy to point out that the fixed value of γ may be suboptimal. However, we decide to choose this specified value since we have experimentally observed that this setting can produce decent performance. Fig. 1 shows the recovery error as a function of the regularization parameter λ, averaged R100

2.5

1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 The regularization parameter λ

D X2F

Recovery Error

min Y −

γ=0.9 3

Recovery Error

In addition, the sequence {H (x1), H (x2 ), H (x3), . . .} is guaranteed to converge to some H (ˆx), where every component xˆl of xˆ is an optimal solution of H with respect to variable xˆl for any l.

R100×400

over 100 random runs. The recovery error is calculated as the 2 norm of the difference between the estimated sparse coefficient vector x˜ and the ground truth x. It can be found that the recovery error is the smallest when λ = 0.05. Then we discuss the effects of the balance parameter γ to the recovery performance by fixing λ = 0.05. Fig. 2 shows the recovery error as a function of the balance parameter γ , averaged over 100 random runs. It can be observed that γ = 0.9 can lead to the best recovery performance. It should be noted the two special cases γ = 0 and γ = 1, respectively, correspond to the 0 penalty and the 1 penalty. Thus, Fig. 2 indicates that the resulting penalty by the proposed Bayesian framework is indeed better than the original 0 penalty and the 1 penalty in a sense. In order to make it more convincing and further demonstrate the superiority of the proposed method, the comparison experiments will be conducted with the synthetic data in the next section. 2) Comparison With State-of-the-Art Methods: In this section, the proposed method is compared with different classical and state-of-the-art sparse coding methods in terms of the recovery performance and the computation time, such as orthogonal matching pursuit (OMP) [36], LASSO [9], feature sign [7], coordinate descent [30], and Olshausen [14]. To make fair intercomparisons, all considered methods are performed with the same randomly generated synthetic data in each run. The synthetic data are generated by the same process aforementioned in Section V-A.1. Before presenting

934

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

5

our method L1−based method

Degree of Sparsity

4

Recovery Error

200

OMP LASSO Feature sign Coordinate descent Olshausen Our method

3 2 1 0

10

150

100

50

20

Number of Nonzero Coordinates

0 0

Fig. 3. Comparison of several sparse coding methods in terms of the recovery error with different number of nonzero coordinates.

20

40

Reconstruction error

Computation Time in Seconds

5

2.5

OMP LASSO Feature sign Coordinate descent Olshausen Our method

4 3 2 1 0

10 20 Number of Nonzero Coordinates

the comparison results, we first clarify the parameter settings of each included sparse coding methods. For OMP, since the number of the nonzero coordinates in the sparse coefficient vector is unknown, we adopt the error constrained version min x0 x

s.t. y − Dx22 ≤ e

(22)

where e represents the error, which has been tuned to be as 0.01 to obtain decent performance. The parameters of the left four methods also have been tuned to be the best to yield optimal performance. Fig. 3 shows the recovery performance of the six sparse coding methods when the number of nonzero coordinates in the ground-truth sparse coefficient vector is 10 and 20, averaged over 100 random runs. As expected, the proposed method achieves better recovery performance with lower recovery error than other methods. Fig. 4 presents the computation time in seconds when the number of nonzero coordinates in the ground-truth sparse coefficient vector is 10 and 20, averaged over 100 random runs. It can be seen that the proposed method always has the lower computation cost than other methods. The reason may be attributed to that the proposed method can always converge fast to stable solution. Consequently, we can come to the conclusion that the proposed method can achieve better performance than stateof-the-art sparse coding methods in most cases while allowing fast computation. B. Degree of Sparsity and Reconstruction Error In this part, to illustrate the effectiveness of the combined penalty, the proposed method is compared with the 1 penalty

100

x 10

our method L1−based method 2 1.5 1 0.5 0 0

Fig. 4. Comparison of several sparse coding methods in terms of the computation time in seconds with different number of nonzero coordinates.

80

(a)

−3

7 6

60

Index of patches

20

40

60

80

100

Index of patches (b) Fig. 5. Comparison of the L1-based dictionary learning method and ours. (a) The degrees of sparsity are shown when reconstruction errors are almost the same for both methods. (b) The reconstruction errors are presented when the degrees of sparsity are almost the same for both methods.

based sparse coding method in terms of the degree of sparsity and the reconstruction error. In our experiments, the degree of sparsity refers to the number of zero components of the coefficient vector x j for image patch while the reconstruction error is the mean square error. In the experiment, 3×3 patches are chosen with overlap of one pixel between adjacent patches. The horizontal axis refers to the indices of 100 random patches sampled from test images. For fairness, we, respectively, train the dictionaries with 256 atoms and do sparse coding by using the proposed method and the 1 penalty based method, respectively. The 1 penalty based method is implemented by using the SLEP package [37]. Recall that the penalty in the proposed objective function is essentially a convex combination of the 0 and the 1 penalties. λ stands for the regularization parameter. γ is a weighted parameter, which is introduced in order to create a balance between the effect of the 0 penalty and that of the 1 penalty. It should be pointed out that the tuning parameter λ is selected by a discrete search considering 40 candidates in the interval [0.01, 0.4] while γ is selected from 11 candidates in the interval [0, 1]. While comparing the degree of sparsity, the regularization parameter λ1 for the 1 penalty based method is set as λ1 = 10−4 . For the proposed method, the parameters are chosen as λ = 0.3 and γ = 0.9. While comparing the reconstruction error, the parameters are selected as λ1 = 0.1, λ = 0.3, and γ = 0.9. Fig. 5(a) shows the degrees of sparsity when reconstruction errors are almost the same for both methods while Fig. 5(b)

LU et al.: SPARSE CODING FROM A BAYESIAN PERSPECTIVE

935

Improvement of PSNR (dB)

2

1

0

−1 LLE Yang method Our method

−2 0

50

100

150

200

Number of Images

Fig. 6. SR results of the parrot image by a factor of 3 and the corresponding PSNRs. Local magnification in the red rectangle is shown in the upperleft corner in each test image. Left to right: input, the original, Bicubic method (PSNR: 29.1532, SSIM: 0.8326), LLE method (PSNR: 28.7872, SSIM: 0.8132), Yang’s method (PSNR: 29.3207, SSIM: 0.8409), and our method (PSNR: 29.7584, SSIM: 0.8532).

Improvement of SSIM

0.06 LLE Yang method Our method

0.04 0.02 0 −0.02 −0.04 −0.06 0

50

100

150

200

Number of Images Fig. 8. PSNR and SSIM gains distributions of three different methods over the Bicubic method on the 200-image dataset. birch_seq_mb 80 OUR L1 IVT OMIST

70

position error (pixel)

60 50 40 30 20 10

Fig. 7. SR results of the girl image by a factor of 3 and the corresponding PSNRs. Local magnification in the red rectangle is shown in the upperleft corner in each test image. Left to right: input, the original, Bicubic method (PSNR: 32.6996, SSIM: 0.7993), LLE method (PSNR: 32.1590, SSIM: 0.7772), Yang’s method (PSNR: 32.9696, SSIM: 0.8134), and our method (PSNR: 33.2663, SSIM: 0.8202).

presents the reconstruction errors when the degrees of sparsity are almost the same. Obviously, compared with the 1 penalty based approach, the proposed method can generate a higher degree of sparsity when reconstruction errors are almost the same, or much smaller reconstruction error when the degrees of sparsity are almost the same. The results show that the proposed method tends to generate more stable and sparse solution. C. Single-Image Super-Resolution In this section, the performance of the proposed method is verified on super-resolution (SR). Sparsity with an

0 0

50

100

150 200 frame number

250

300

Fig. 9. Quantitative comparison of the trackers in terms of position errors (in pixel).

approprite dictionary is a strong prior to achieve good SR performance [11]. Since the proposed sparse coding method can obtain better results in both sparsity and reconstruction error, it can perform SR better. In the experiments, the training images are chosen from [11], which contain different types, such as flowers, human faces, architectures, animals, cars, and so on. The magnification factor is set as 3, which is common in the literature of single image super-resolution. For the low-resolution (LR) images, we use 3 × 3 patches, with an overlap of one pixel between adjacent patches, corresponding to 9 × 9 patches with overlap of three pixels for

936

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

Fig. 10. Tracking results of the sequence birch_seq_mb by our method, the 1 penalty based method [13], the IVT method [38], and the OMIST method [39], respectively (from top to down). Left to right: frame 102, frame 243, frame 310, frame 318, frame 331, and frame 440.

high-resolution (HR) patches. The dictionary size is fixed as 1024, which is a balance between image quality and computation [11]. Following the way in [11], we first jointly learn two dictionaries for low resolution and high resolution, respectively, with the proposed sparse coding method. Then the sparse representations for patches of the given LR images can be obtained by Algorithm 1 and the coefficients of the representations can be utilized to generate HR images. The parameters are selected as λ = 0.3 and γ = 0.9. Figs. 6 and 7 show the super-resolution results of our method and those of some state-of-the-art methods, including the 1 penalty based method and locally linear embedding (LLE)-based method. It can be found that the proposed method generates sharper and visually more appealing reconstruction image, which also yields higher peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). Specifically, the proposed method can reconstruct better on the girl’s nose and the parrot’s neck. For fairness, the same 100 000 patch pairs are used for different methods. To further evaluate the performance of the proposed method, SR experiments are conducted on a large dataset, which includes 200 natural images of plants, animals, humans, and buildings averagely. The images are downloaded from the internet. It can be seen from Fig. 8 that the proposed method performs constantly better than others, which also implies the robustness of our method. D. Visual Tracking In this section, the proposed dictionary learning algorithm is utilized for another application, i.e., visual tracking. This application is mainly based on the assumption that the appearance of a tracked object can be sparsely represented by its

TABLE I A CCURACIES OF T RACKING R ESULTS (M EAN E RROR AND S TANDARD D EVIATION ) BY D IFFERENT M ETHODS Our

L1

Ivt

Omist

Mean (birch_seq_mb)

4.4989

5.6459

43.1728

39.6318

Standard Variation (birch_seq_mb)

2.8894

3.1156

12.9553

11.8774

appearances in previous frames [13]. Hence, the proposed method is very appropriate for the tracking problem because of its superiority in sparse representation. The test sequences are from the literature [13]. In the experiments, the template size is chosen as 12×15 according to the width to height ratio for the target image in the first frame. The number of target templates is 10, which is the same as that in [13] for fair comparison. It should be pointed out that the initial target position is manually selected in our experiments. The parameters are selected as λ = 0.03 and γ = 0.9. Fig. 10 shows some samples of the tracking results of the sequences birch_seq_mb by our method, the 1 penalty based method [13], the IVT method [38], and the OMIST method [39], respectively. It can be found that our tracker is able to track better, even when there exist occlusions in the image. Table I summarizes the mean and standard deviation of the localization errors by the different methods. It can be shown in Table I that the proposed method tracks the object more robustly and more reliably by achieving the smallest mean and standard deviation values. To further illustrate the superiority of the proposed method in tracking, Fig. 9 shows

LU et al.: SPARSE CODING FROM A BAYESIAN PERSPECTIVE

the quantitative comparison of the trackers in terms of position errors.

937

If λγ +  < θ ≤ λϕ

min F(x) = min

x∈[0,∞]

VI. C ONCLUSION

x∈[0,)

Note that F(θ −λγ )−F(0) = −1/2(λγ −θ )2 +λ(1−γ ), which √ implies when λγ + √ < θ ≤ λγ + 2λ, (1 − γ ), F(0) ≤ F(θ − λγ ). When λγ + 2λ(1 − γ ) < θ ≤ λϕ, F(0) ≥ F(θ − λγ ). Therefore, √ if λγ +  < θ ≤ λγ + 2λ, (1 − γ ) min F(x) = F(0);

x∈[0,∞)

if λγ +

√ 2λ(1 − γ ) < θ ≤ λϕ min F(x) = F(θ − λγ ).

x∈[0,∞)

If λϕ < θ ≤ λϕ + 

A PPENDIX

min F(x) = min{F(θ − λγ ), F(θ − λϕ)}

The appendix is to present the proofs of Lemma 1 and Proposition 1 in this paper.

x∈[0,∞)

= F(θ − λγ ). Because

A. Proof of Lemma 1 Proof: To obtain the solution of Eq. (19), we consider the following different cases. 1) Case 1 x ∈ [, ∞) 1 (x − θ )2 + λγ x + λ(1 − γ ) 2

which implies F  (x) = x −θ +λγ . If θ ≥ λγ +, then by setting F  (x) to zero, we get x = θ −λγ . If θ < λγ +, then F  (x) > 0 for all x ∈ [, ∞). Therefore  θ − λγ , θ ≥ λγ +  arg min F(x) = sgn(θ )(|θ | − λγ ), f < λγ + . x∈[,∞)

F(θ − λγ ) − F(θ − λϕ)   1 1 = λγ θ − λ2 γ 2 + λ(1 − γ ) − λϕθ − λ2 ϕ 2 2 2   1 = λ(ϕ − γ ) λ(ϕ + γ ) − θ + λ(1 − γ ) 2   1 ≤ λ(ϕ − γ ) λ(ϕ + γ ) − λϕ + λ(1 − γ ) 2 2 λ (1 − γ )2 =− + λ(1 − γ ) 2 2  λ(1 − γ ) + 1 < 0. = λ(1 − γ ) − 2 2 If θ > λϕ + 

2) Case 2 x ∈ [0, ] λ(1 − γ ) 1 (x − θ )2 + λγ x + x 2  1 = (x − θ )2 + λϕx 2

F(x) =

where ϕ = γ + λ(1 − γ )/. This implies F  (x) = x − θ + λϕ. If λϕ ≤ θ ≤ λϕ + , then by setting F  (x) to zero, we get x = θ − λϕ. If θ < λϕ, then F  (x) > 0 for all x ∈ [0, ]. If θ > λϕ + , then F  (x) < 0 for all x ∈ [0, ]. Therefore ⎧ ⎪ ⎨arg min x∈[0,) F(x) =0, θ < λϕ θ − λϕ, λϕ ≤ θ ≤ λϕ +  ⎪ ⎩ , θ < λγ + . Summarizing the above two cases, if θ ≤ λγ + 

 min F(x) = min min F(x), min F(x) x∈[0,∞]

min F(x), min F(x)

x∈[,∞)

= min {F(θ − λγ ), F(0)}

In this paper, a novel Bayesian framework was proposed for sparse coding. The MAP estimation of the framework produced a new penalty, which leads to higher degree of sparsity and lower reconstruction error in sparse coding in comparison with the 0 and the 1 penalties. This gives rise to the success of the proposed method in applications in singleimage super-resolution and visual tracking. Moreover, the convergence property of the proposed algorithm for sparse coding was also established based on the proposed framework. The experimental results further demonstrated the effectiveness of our method.

F(x) =



x∈[,∞)

x∈[0,]

= min{F(), F(0)} = F(0).

min F(x) = min{F(θ − λγ ), F()} = F(θ − λγ ).

x∈[0,∞)

Therefore

 √ 0, θ ≤ λγ + 2λ(1 − γ ) arg min F(x) = √ θ − λγ , θ > λγ + 2λ(1 − γ ). x∈[0,∞)

Similarly

 √ 0, θ ≥ −λγ − 2λ(1 − γ ) arg min F(x) = √ θ + λγ , θ < −λγ − 2λ(1 − γ ). x∈(−∞,0]

Summarizing, √ we have if |θ | ≤ λγ + 2λ(1 − γ ) arg min F(x) = 0; x∈R

√ if |θ | > λγ + 2λ(1 − γ ) arg min F(x) = sgn(θ )(|θ | − λγ ). x∈R

This completes the proof.

938

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 6, JUNE 2013

B. Proof of Proposition 1 Proof: Suppose = (x 1 , x 2 , . . . , x h , . . . , x n ), h = arg min l Hl and the input signal is y = [y1 , y2 , . . . , y K ]. Then by Algorithm 1, we have xt +1 = (x 1 , x 2 , . . . , x˜h , . . . , x N ) and x˜h = Sγ (βh , λ)/wh , K  where βh = j =1 d j h (y j − l =h d j l x l ). We also define wl = dl 22 , l = 1, 2, . . . , N, and w = [w1 , w2 , . . . , w N ]. Observe that xt

 1 1 y − Dxt 22 = (y j − d j l x l )2 2 2 = =

K 1

2

(y j −

j =1

K

N

j =1

l=1



d j l x l − d j h x h )2

l =h

  1  2 2 [ d j h xh − 2 d j h (y j − d j l xl )x h 2 +

K

K

j =1

j =1

K 

(y j −

j =1



l =h

Recall that in Algorithm 1 we choose h = arg min l Hl , which implies for any l Hl (xt (k) ) ≤ Hh (xt (k) ) = H (xt (k)) − H (xt (k)+1) and thus 0 ≤ Hl (ˆx) = lim Hl (xt (k) ) k→∞

≤ lim (H (xt (k)) − H (xt (k)+1)) = 0. k→∞

Therefore H (xˆ1, xˆ2 , . . . , xˆl , . . . , xˆ N ) = H (xˆ1, xˆ2 , . . . , x˜ˆl , . . . , xˆ N ) for any l. However, by Lemma 1, we know that x˜ˆl is the unique optimal solution of H with respect to variable xl . So xˆl = x˜ˆl , l = 1, 2, . . . , N. This means that every component xˆl of xˆ is an optimal solution of H with respect to variable xl for any l. The proof is completed.

d j l x l )2 ]

l =h

R EFERENCES

1 = wh (x h − βh /wh )2 + C1 2   where C1 = 1/2 Kj=1 (y j − l=h d j l xl )2 − 1/2βh2 is inde pendent of x h . In the last equality wh = dh 22 = Kj=1 d 2j h is utilized. Hence H (xt ) = wh F(x h ) + C2 where 1 (x h − βh /wh )2 + λγ /wh |x h | 2 +λ(1 − γ )/wh L 0 (x h )   and C2 = C1 + λγ l=h |xl | + λ(1 − γ ) l=h L 0 (xl ) is also independent of x h . By Lemma 1, it is not difficult to find that x˜h = Sγ (βh , λ)/wh = Sγ (βh /wh , λ/wh ) minimizes F(x h ). So F(x h ) =

H (xt +1) − H (xt ) = F(x˜h ) − F(x h ) ≤ 0 which implies H (xt +1) ≤ H (xt ), for all t = 1, 2, . . .. Hence, H (xt ) is decreasing and bounded from below(H (xt ) is nonnegative). This implies H (xt ) converges to some Hˆ , limt →∞ H (xt ) = Hˆ . It is easy to know that lim xl →∞ H (x 1, x 2 , . . . , x n ) = ∞ for any 1 ≤ l ≤ N by the definition of H (xt ). However, H (xt ) ≤ H (x1), t = 1, 2, . . . . So the sequence {xt } must be bounded, which implies there exists a cluster point xˆ and a subsequence {xt (k) } satisfying limk→∞ xt (k) = xˆ . Since H (x) is continuous, H (ˆx) = Hˆ . Recall that Hl (xt ) = H (x 1, x 2 , . . . , xl , . . . , x N ) −H (x 1, x 2 , . . . , x˜l , . . . , x N ). By Lemma 2, x˜l as a function is continuous on v = (x 1 , x 2 , . . . , xl−1 , xl+1 , . . . , x N ) and thus Hl is continuous. Therefore Hl (ˆx) = Hl ( lim xt (k) ) = lim Hl (xt (k) ). k→∞

k→∞

[1] S. Abdallah and M. Plumbley, “Unsupervised analysis of polyphonic music by sparse coding,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 179–196, Jan. 2006. [2] K. Labusch, E. Barth, and T. Martinetz, “Simple method for highperformance digit recognition based on sparse coding,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1985–1989, Nov. 2008. [3] Y. Li, A. Cichocki, S. Amari, S. Xie, and C. Guan, “Equivalence probability and sparsity of two sparse solutions in sparse representation,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2009–2021, Dec. 2008. [4] E. Elhamifar and R. Vidal, “Robust classification using structured sparse representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1873–1879. [5] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Trans. Image Process., vol. 17, no. 1, pp. 53–69, Jan. 2008. [6] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [7] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse coding algorithms,” in Advances Neural Inform Processing Systems, Cambridge, MA, USA: MIT Press, 2007. [8] B. Olshausen and D. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. [9] B. Trevor, T. Hadtie, L. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Stat., vol. 32, no. 2, pp. 407–499, 2004. [10] T. Wu and K. Lange, “Coordinate descent algorithms for LASSO penalized regression,” Ann. Appl. Stat., vol. 2, no. 1, pp. 224–244, 2008. [11] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov. 2010. [12] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking using local sparse appearance model and k-selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 1313–1320. [13] X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2259–2272, Nov. 2011. [14] B. Olshausen and D. Field, “Sparse coding with an over-complete basis set: A strategy employed by VI?,” Vis. Res., vol. 37, no. 23, pp. 3311–3325, 1997. [15] S. Bohte, H. Poutre, and J. Kok, “Unsupervised clustering with spiking neurons by sparse temporal coding and multilayer rbf networks,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 426–435, Mar. 2002. [16] M. Zheng, J. Bu, C. Chen, C. Wang, L. Zhang, G. Qiu, and D. Cai, “Graph regularized sparse coding for image representation,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1327–1336, May 2011. [17] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, Dec. 2006.

LU et al.: SPARSE CODING FROM A BAYESIAN PERSPECTIVE

[18] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Single Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [19] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” in Advances in Neural Information Processing Systems, Cambridge, MA, USA: MIT Press, 2009. [20] Z. Xu, X. Chang, F. Xu, and H. Zhang, “L 1/2 regularization: A thresholding representation theory and a fast solver,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1013–1027, Jul. 2012. [21] Y. Liu and Y. Wu, “Variable selection via a combination of the L 0 and L 1 penalties,” J. Comput. Graph. Stat., vol. 16, no. 4, pp. 782–798, 2007. [22] R. Tibshirani, “Regression shrinkage and selection via the LASSO,” J. Royal Stat. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996. [23] C. H. Zhang, “Nearly unbiased variable selection under minimax concave penalty,” Ann. Stat., vol. 38, no. 2 pp. 894–942, 2010. [24] E. George and R. Mcculloch, “Approaches for bayesian variable selection,” Stat. Sinica, vol. 7, pp. 339–374, Jan. 1997. [25] T. J. Yen, “A majorization-minimization approach to variable selection using spike and slab priors,” Ann. Stat., vol. 39, no. 3, pp. 1748–1775, 2011. [26] A. Genkin, D. Lewis, and D. Madigan, “Large scale bayesian logistic regression for text categorization,” Technometrics, vol. 49, no. 3, pp. 291–304, 2007. [27] R. Ilin, “Unsupervised learning of categorical data with competing models,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp. 1726–1737, Nov. 2012. [28] B. Gao, W. Woo, and S. Dlay, “Variational regularized 2-d nonnegative matrix factorization,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 5, pp. 703–716, May 2012. [29] M. Seeger, F. Steinke, and K. Tsuda, “Bayesian inference and optimal design in the sparse linear model,” J. Mach. Learn. Res., vol. 9, pp. 759–813, 2008. [30] Y. Li and S. Osher, “Coordinate descent optimization for l1 minimization with application to compressed sensing: A greedy algorithm,” CAM Rep., Univ. of CA, Los Angeles, USA, 2009. [31] T. Park and G. Casella, “The bayesian LASSO,” Tech. Rep., vol. 103, no. 482, pp. 681–686, 2008. [32] M. Tipping, “Sparse bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, Jan. 2001. [33] M. Lewicki and T. Sejnowski, “Learning overcomplete representations,” Neural Comput. vol. 12, no. 2, pp. 337–365, Feb. 2000. [34] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 689–696. [35] K. Labusch, E. Barth, and T. Martinetz, “Robust and fast learning of sparse codes with stochastic gradient descent,” IEEE Trans. Sel. Topics Signal Process., vol. 5, no. 5, pp. 1048–1060, Sep. 2011.

939

[36] J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004. [37] J. Liu, S. Ji, and J. Ye, SLEP: Sparse Learning With Efficient Projections, Glendale, AZ, USA: Arizona Univ. Press, 2009. [38] D. Ross, J. Lim, R. Lin, and M. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 125–141, May 2008. [39] Q. Zhou, H. Lu, and M. Yang, “Online multiple support instance tracking,” in Proc. IEEE Conf. Autom. Face Gesture Recognit., Mar. 2011, pp. 545–552.

Xiaoqiang Lu received the Ph.D. degree from the Dalian University of Technology, Dalian, China. He is a Researcher with the Center for Optical Imagery Analysis and Learning, State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China. His current research interests include pattern recognition, machine learning, hyperspectral image analysis, cellular automata, and medical imaging.

Yulong Wang is currently pursuing the M.S. degree with the Faculty of Mathematics and Computer Science, Hubei University, Wuhan, China. His current research interests include hyperspectral image analysis, pattern recognition, and machine learning.

Yuan Yuan (M’05–SM’09) is a Researcher and Full Professor with the Chinese Academy of Sciences, China. Her current research interests include visual information processing and image/video content analysis.

Sparse coding from a Bayesian perspective.

Sparse coding is a promising theme in computer vision. Most of the existing sparse coding methods are based on either l0 or l1 penalty, which often le...
969KB Sizes 2 Downloads 3 Views