1872

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Visual Tracking via Discriminative Sparse Similarity Map Bohan Zhuang, Huchuan Lu, Senior Member, IEEE, Ziyang Xiao, and Dong Wang

Abstract— In this paper, we cast the tracking problem as finding the candidate that scores highest in the evaluation model based upon a matrix called discriminative sparse similarity map (DSS map). This map demonstrates the relationship between all the candidates and the templates, and it is constructed based on the solution to an innovative optimization formulation named multitask reverse sparse representation formulation, which searches multiple subsets from the whole candidate set to simultaneously reconstruct multiple templates with minimum error. A customized APG method is derived for getting the optimum solution (in matrix form) within several iterations. This formulation allows the candidates to be evaluated accurately in parallel rather than one-by-one like most sparsity-based trackers do and meanwhile considers the relationship between candidates, therefore it is more superior in terms of cost-performance ratio. The discriminative information containing in this map comes from a large template set with multiple positive target templates and hundreds of negative templates. A Laplacian term is introduced to keep the coefficients similarity level in accordance with the candidates similarities, thereby making our tracker more robust. A pooling approach is proposed to extract the discriminative information in the DSS map for easily yet effectively selecting good candidates from bad ones and finally get the optimum tracking results. Plenty experimental evaluations on challenging image sequences demonstrate that the proposed tracking algorithm performs favorably against the state-of-the-art methods. Index Terms— Object appearance model.

tracking,

sparse

representation,

I. I NTRODUCTION

V

ISUAL tracking, one of the fundamental topics in computer vision, has long been playing a critical role in numerous applications such as surveillance, military reconnaissance, motion recognition and traffic monitoring. While much breakthrough has been made within the last decades (like [7]–[16], etc), it still remains challenging in many aspects including pose variation, illumination change, partial

Manuscript received August 20, 2013; revised December 18, 2013 and February 17, 2014; accepted February 17, 2014. Date of publication February 26, 2014; date of current version March 14, 2014. This work was supported in part by the Natural Science Foundation of China under Grants 61071209 and 61272372, and in part by the Joint Foundation of China Education Ministry and China Mobile Communication Corporation under Grant MCM20122071. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Richard J. Radke. The authors are with the School of Information and Communication Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2308414

Fig. 1. Challenges during tracking in real-world environments, including heavy occlusions (Woman), abrupt motion (Face), illumination change (Singer1), pose variation (Girl) and complex background (Cliffbar). We use blue, green, black, yellow, magenta, cyan and red rectangles to represent the tracking results of the OSPT [1], APGL1 [2], LSAT [3], ASLAS [4], MTT [5], SCM [6] and the proposed method, respectively.

occlusion, camera motion and background clutter, like we demonstrate in Fig. 1. A general way to construct a robust tracking system involves two key components: a motion model, e.g., particle filter [17] or Kalman filter [18], that forecasts the likely movements of the target over time to supply the tracker with a number of candidate states; an observation model (or an appearance model) that evaluates the likelihood of each candidate state being the true target state and selects the best candidate as the tracking result for the current frame, which is the core of a tracking system. Since the first time sparse representation is introduced into visual tracking by Mei and Ling [19], it has been employed to build various efficient trackers (we refer them as sparse trackers in the following paper) with favorable experimental performance against other state-of-the-art trackers. In [19], the feature vector of a candidate state is reconstructed by both the target templates and the trivial templates (accounting for

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

noisy pixels) with the sparsity and nonnegativity constraints on the reconstruction coefficients. Later, the likelihood of this candidate being the true target is measured upon its error in being reconstructed by the target templates. This method requires solving the 1 minimization problem as many times as the number of candidates, making it quite computational expensive. To explore more efficient solutions within the same framework, in [20] an approximate solution is developed to reduce the number of particles that need to be sparsely decomposed, and in [2] an efficient gradient descent approach is introduced to accelerate the solving process of the 1 minimization problem. Sharing the candidate evaluation scheme in [19], some other sparsity based tracking algorithms build new formulations with customized sparse constraint terms. In [3], Liu et al. select a sparse set of features for representing target objects and extend the sparsity constraint to the dynamic group sparsity constraint considering the contiguous distribution of noised pixels. Zhang et al. [5] formulate the tracking problem using sparse representation within the multi-task learning framework in which the similarities between candidates are exploited by enforcing joint group sparsity with mixed norm constraints. An algorithm also considering the relevance among candidates is presented in [21] where the tracking problem is posed as a low-rank matrix learning problem. Although these new formulations are effective in modeling the object, the reconstruction error based candidate evaluating scheme that they share is neither efficient nor robust. Therefore, several sparse trackers not only propose new sparsity involved models but also introduce improvements on the candidate evaluation scheme. Liu and Sun [22] propose to use a dictionary composed by all candidates and trivial templates to represent a static object template and view the decomposition coefficient as the similarity between all candidates and the templates. Wang et al. [1] replace the target templates with online updated PCA basis vectors, which can better express the target object subspace. Meanwhile, they use an occlusion mask to explicitly consider the effect of occluded pixels when evaluating a candidate. Jia et al. [4] propose a structural local sparse appearance model that integrates local and global information of an observed image through an alignment pooling method, and the coefficients after pooling are summed to sort the candidates. Zhong et al. [6] develop two independent sparsity-based models and evaluate the candidates by integrating the information from both models. All the aforementioned sparsity based methods yield impressive tracking performance, however, most of them focus on measuring how a candidate is resembling the foreground object while ignoring the background information, which makes them subject to drifts when objects are similar to the target appearance or when the target appearance bears some similarity with the background objects due to partial occlusion. Although Zhong et al. [6] employ a discriminative model, it is more of an assistant to the generative model and it makes the tracker redundant since hundreds of the candidates are all evaluated twice, which entails the number of involved 1 minimization problem to be doubled, greatly aggravating the computational complexity.

1873

Through the above analysis, we propose a reversed multitask sparse tracking framework which projects the templates matrix (both positive and negative templates) into the candidates space. By selecting and weighting the discriminative sparse coefficients, the DSS map and pooling method lead to the best candidate. Our contributions can be summed up in the following three aspects: • First, we propose an innovative optimization formulation named multi-task reverse sparse representation. In our work, a single task means to reconstruct a template with a few candidates that bear more similarity with the template than the others, which is inverse to the traditional sparsity based formulations (like those in [1]–[6], [19], [20]) and multi-task means that we seek to simultaneously reconstruct multiple templates. A customized APG method is derived for getting the optimum solution (in matrix form) within several iterations. A Laplacian term is also included to keep the coefficients similarity level in accordance with the candidates similarities, which makes our tracker more robust as the experimental observations show. This formulation provides the tracker with the similarity relationship between all the candidates and templates through solving only one optimization problem without loss of accuracy. Therefore, this formulation is more superior in terms of cost-performance ratio. • Second, we construct a discriminative sparse similarity map (DSS map) based upon those similarity relationship. The discriminative information containing in this map comes from a large template set composed by multiple positive target templates and hundreds of negative templates. Both the target templates and the background templates are updated online to accommodate the appearance change in and near the target area. With this DSS map, candidates are evaluated in both directions: not only how similar it is to the target object but also how different it is from the background. This is also one of the key difference from most previous sparse trackers like [1]–[5], [19]–[21], [23], [24], making our tracker more robust when similar objects appear near the target or when the target appearance bears some similarity with the background due to partial occlusion. • Third, we propose a simple yet useful additive pooling method to make the best use of the information in the DSS map and before this step the DSS map would be refined with adaptive weights to get rid of potential instability. Through this pooling scheme, the information for each candidate is integrated to be a single score and the candidate with the highest score is regarded as the tracking result. II. BAYESIAN I NTERFERENCE F RAMEWORK We carry out the object tracking in a Bayesian interference framework, a technique for estimating the posterior distribution of state variables that characterize a dynamic system, to form a robust tracking algorithm. We define the observation set of target Zt = [z1 , z2 , . . . , zt ], and let xt be the state variable of an object at time t. In the tracking frame, we use the

1874

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

affine transformation to model the object motion between two consecutive frames. Then the optimal state xˆ t can be computed by the maximum a posterior (MAP) estimation, xˆ t = arg max p (xti |Zt )

(1)

xti

where xti indicates the state of the i -th sample. The posterior probability can be inferred from the Bayesian framework recursively,  p(xt |Zt ) ∝ p(zt |xt ) p(xt |xt −1) p(xt −1|Zt −1 )dxt −1 (2) where p(xt |xt −1) is the dynamic model and p(zt |xt ) denotes the observation likelihood. The state variable xt is composed of six independent parameters {1 , 2 , 3 , 4 , 1 , 2 }, in which {1 , 2 , 3 , 4 } are the deformation parameters and {1 , 2 } contain the 2D transformation information. As the dynamic model can be modeled by the Gaussian distribution, it can be represented by p(xt |xt −1 ) = N(xt ; xt −1, σ )

(3)

where σ is a diagonal covariance matrix whose elements are the variances of the affine parameters. Through this method, we get the candidates set Y = [y1 , y2 , ..., ym ] ∈ R d×m , in which d is the feature dimension and m is the number of candidates. The observation model p(zt |xt ) essentially reflects the likelihood of observing zt at state xt . In this paper, p(zt |xt ) is proportional to the discriminative score obtained by exploiting the additive pooling scheme on the DSS map. III. P ROBLEM F ORMULATION A. Discriminative Reverse Sparse Representation To construct a robust tracker, the number of templates and candidates always amount to hundreds or even thousands and high-dimension features must be used to keep the profuse target information. However, traditional sparse coding based trackers perform computationally expensive 1 regularizations at each frame for each candidate. Hundreds of 1 regularizations per frame make the computational load so high that the tracker is unsuitable to process high-dimensional image features for fast and robust tracking applications under dynamic environment. As a reverse thought to conventional sparse representation, where a candidate (an observed image patch associated with a state) is reconstructed mainly by several target templates, we construct the dictionary with the candidate set Y to represent each target template as in Eq. 4 with the sparsity and nonnegativity constraints, arg min ||t − Yc||22 + λ||c||1 , c

s.t. c  0

(4)

where t denotes a representative template, λ is the parameter to adjust the sparsity penalty term and c represents the coefficient vector. With the sparsity constraint and the goal to minimize the reconstruction error term, only a few candidates that bear more similarity to the template would be involved in representing

the template. Their associated elements in c are positive and the magnitudes of these elements are assumed to imply the similarity levels. Thus, we add a constraint entry, c  0, which means all the elements of c are nonnegative for the reason that each element represents the similarity between the corresponding template and candidate, and negative elements are meaningless. Beyond that, although through using the 1 minimization can the tracker be efficient and adaptive to appearance change, the lack of negative templates makes its discriminative power poor for ignoring the background information around the target, which may cause the tracker gradually drift away from the target. Therefore, in this work, multiple positive target templates are exploited so as to make the tracker more responsive to a variety of appearance change. Meanwhile, in order to better capitalize on the distinction between the foreground and the background to locate the target, we use plenty of negative templates, which are capable of fully sketching out the periphery of the target area. The positive and negative template sets are respectively defined as T pos = [t1 , t2 , ..., t p ] and Tneg = [t p+1 , t p+2 , ..., t p+n ], where p and n denote the number of positive and negative templates. With these assumptions, our problem formulation is equivalent to an ensemble of sparse decomposition problems that the templates are effectively expressed by finding the combination of the particles and the corresponding coefficients as the following: ⎧ 2 ⎪ ⎪arg min ||t1 − Yc1 ||2 + λ||c1 ||1 ⎪ c ⎪ 1 ⎪ ⎪ ⎪ ⎪ ...... ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎨argcmin ||t p − Yc p ||2 + λ||c p ||1 p (5) ⎪ arg min ||t p+1 − Yc p+1 ||22 + λ||c p+1 ||1 ⎪ ⎪ c p+1 ⎪ ⎪ ⎪ ⎪ ⎪ ...... ⎪ ⎪ ⎪ ⎪ 2 ⎪ arg min ||t − Yc p+n p+n ||2 + λ||c p+n ||1 ⎩ c p+n

where ci = [ci1 , ci2 , . . . , cim ] expresses the sparse coefficients of the i -th template and ci  0, i = 1, 2, . . . , ( p + n) means all the elements in ci are nonnegative. In this formulation, one template is decomposed in each sparse representation procedure through 1 optimization and the whole process terminates until all the positive and negative templates have been represented. We give an illustration of the basic idea of this formulation in Fig. 2. The matrix formed by the reconstruction coefficient vectors of all templates are defined as the sparse map C = [c1 , . . . , c p+n ], which fundamentally reflects a mapping relationship between the reference templates and the candij dates, i.e., the value of a map element ci can be comprehended as an indicator of similarity between the i -th template and the j -th candidate. The candidates that contribute more to reconstruct one template should correspond to a large map element while those involve little information of the template should correspond

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1875

Fig. 2. Problem Formulation. This figure illustrates the basic idea of the multi-task reverse sparse representation scheme. (a) The positive and the negative template sets. (b) The sampled candidates. (c) The discriminative sparse similarity map (DSS map).

to smaller ones, in more cases zero. Meanwhile we got two sub-maps, i.e., C pos = [c1 , . . . , c p ] for the positive template set and Cneg = [c p+1 , . . . , c p+n ] for the negative one. The rightmost part in Fig. 2 is the sparse map C, each column vector of which implicates the sparse coefficients of all the candidates representing one template.

any two candidate features with Bi j = 1 if ci is among the K nearest neighbors of c j , otherwise, Bi j = 0. The last part of this formula can be transformed as: 1 ||ci − c j ||22 Bi j 2 ij    = ||ci ||2 Di + ||c j ||2 D j − 2 ci  c j B i j i

B. Laplacian Multi-Task Reverse Sparse Representation Overall, the formulation presented in Eq. 5 suffers from two principal problems. First, it still requires solving multiple 1 minimization problems per frame, which is computationally expensive especially when a large number of template set is maintained. Second, the dependence information among features of particles is ignored, even similar features may have unreasonable difference in the responses of sparse representation, which specifically embodies in the disparity of the corresponding coefficients . In order to alleviate these defects, we reformulate the problem of calculating decomposition coefficients for multiple templates into a single optimization procedure, where the optimum similarity map C can be calculated as a whole. Intuitively, we propose a multi-task concept here, in which a single task means one template can be represented in the form of linear combination of a few similar candidates, and further, the multi-task refers to reconstruct multiple templates simultaneously. We name this procedure a multi-task reverse sparse representation problem as Eq. 6  ||ci ||1 arg min ||T − YC||22 + λ C

i

s.t. ci  0, i = 1, 2, . . . , ( p + n).

(6)

In addition, to preserve the similarity of sparse codes for the similar candidate features, we introduce a customized Laplacian regularization term inspired by the success of similar implementation for image classification [25]. To begin with, we have the following formulation:  δ ||ci ||1 + ||ci − c j ||2 Bi j arg min ||T − YC||22 + λ 2 C

j

ij



= 2tr (CLC )

(8)

where L = D − B is the Laplacian matrix , the degree of ci is p+n  defined as Di = Bi j and D = di ag(D1, D2 , . . . , D p+n ). j =1

So, the Laplacian multi-task reverse optimization problem is reformulated as:  ||ci ||1 + δtr (CLC ) arg min ||T − YC||22 + λ C

i

s.t. ci  0, i = 1, 2, . . . , ( p + n).

(9)

Let 1 ∈ R m (m is the number of candidates) denote the column vector whose entries are all ones and denote (ci ) as: 0 ci  0 ψ(ci ) = (10) +∞ other wi se With this non-negative constraint, Eq. 9 can be optimized alternately as: arg min ||T − YC||22 + λ1 C1 + δtr (CLC ) + ψ(C)

(11)

C

where ai represents the i -th template. Then we apply the accelerated proximal gradient (APG) approach [2] to solve this minimization problem with F(C) = arg min ||T − YC||22 + λ1 C1 + δtr (CLC ) C

G(C) = ψ(C)

(12)

where F(C) is a differentiable convex function and G(C) is a non-smooth convex function. Following the APG method, we need to solve an optimization problem:

(7)

∇ F(εk+1 ) 2 ξ ||2 + G(C) (13) μk+1 = arg min ||C − εk+1 + 2 ξ C

where δ is the parameter to adjust the new regularization term and B is a binary matrix indicating the relationship between

where ξ is the Lipschitz constant as function F(εk+1 ) has the nature of continuous gradient and the variable k denotes

i

s.t. ci  0, i = 1, 2, . . . , ( p + n),

ij

1876

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 3. This figure illustrates how the discriminative sparse similarity map indicates whether a candidate is good or not. (a) The original discriminative similarity map. A typical good candidate and a bad one are picked as examples. (b) The process regarding how to obtain the refined discriminative feature for the good candidate. The notion  is the Hadamard product (element-wise product). (c) The process regarding how to obtain the refined discriminative feature for the bad candidate. The sub features related to the positive/negtive templates are shown in red/green. We notice that the positive part in the refined discriminative feature for the bad candidate is weakened by the adaptive weights.

the current interation time. Then we define gk+1 = εk+1 + ∇ F (εk+1 ) . Since ξ

Algorithm 1 Algorithm for Optimizing the Laplacian MultiTask Reverse Sparse Representation.

∇ F(εk+1 ) = −Y (T − Yεk+1 ) + δεk+1 (L + L) + λ11∗  , (14) we can easily get the formulation: 1 gk+1 = εk+1 + [−Y (T−Yεk+1 )+δεk+1 (L +L)+λ11∗ ] ξ (15) where 1∗ ∈ R ( p+n) ((p+n) is the number of templates) denotes the vector whose entries are all ones. Based on the above assumption, Eq. 13 is equivalent to μk+1 = max(0, gk+1 )

(16)

The algorithm for solving our Laplacian multi-task reverse sparse representation problem is summarized in Algorithm 1. The computational complexity of each iteration in Algorithm 1 is dominated by step 3. Thus, we can easily compute the perframe complexity to be O(kd( p + n)), where k is the iteration number, d is the feature dimension and p + n is the total number of templates. IV. O BJECT T RACKING VIA THE R EFINED DSS M AP A. Weighted Discriminative Sparse Similarity Map 1) Discriminative Sparse Similarity Map: In this subsection, we further interpret the discriminative sparse similarity map. As being introduced in the above sections and demonstrated in Fig. 2, each column of C denotes the coefficients of a certain template decomposed by all candidates. However, it is worth explaining that each row of C corresponds to the responses of one candidate on all templates, which can be viewed as a discriminative feature of this candidate. For the i -th candidate, we have fi = [Ci1 , . . . , Cip , Ci( p+1) , . . . , Ci( p+n) ]

(17)

where Ci j is the element in the i -th row and the j -th column of C. From this perspective, the similarity map can be represented as F = [f1 , . . . , fm ] = C ,

(18)

where each column is a discriminative feature of a candidate, indicating its similarity levels to p positive templates and n negative templates. The discriminative nature of this feature can be reflected from its larger elements distribution as shown in Fig. 3. For a good candidate, the index of larger elements in f must be in the range [1, p], corresponding to several positive templates. Likewise, a bad candidate should be more similar to some negative templates, which results in larger coefficients index ranging in [ p + 1, p + n] while small or even zero coefficients on representing positive templates. For the subsequent implementation, we define two sub similarity maps as F pos = C pos and Fneg = C neg .

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1877

2) Refined Discriminative Sparse Similarity Map: To get rid of potential instability and achieve better robustness, we refine the DSS map with adaptive weights. The weight Wi j for an element Fi j in the similarity map is constructed based on the difference between the j -th candidate y j and the i -th template ti : Wi j ∝ exp(−||ti − y j ||22 ).

(19)

A candidate with smaller difference from a foreground template indicates they share higher similarity with each other, representing that the candidate is more likely to be a target object, and vice versa. For the following employment, we separate the weight map into two submaps:  W pos = [w1 , . . . , w p] ,   Wneg = [w p+1 , . . . , w p+n ] ,

(20)

where wi = [Wi1 , . . . Wim ] for i = 1, 2, . . . , ( p + n). Then we can get two weighted DSS maps through: F˜ pos = W pos  F pos ,

(21)

F˜ neg = Wneg  Fneg ,

(22)

and where  is the Hadamard product (element-wise product). In the weighted DSS map, an element F˜i j = Wi j · Ci j is supposed to be large only when the j -th candidate has small difference from the i -th template and it plays a significant role in decomposing the i -th template with other candidates. Otherwise, F˜i j will have a small or even zero value, indicating that j -th candidate bears little similarity with the i -th template. An example is shown in Fig. 3(c) to illustrate the benefit of this refinement process. For the bad candidate, the sub-feature related to the positive templates (in red) are non-zero since the positive templates might account for some minor parts of the bad candidate in the 1 minimization process. Although their values are small, they might cause unexpected tracking result. However, by applying the adaptive weight, the refined subfeature related to positive templates (in red) are suppressed to be close to zero, which means the bad candidate just bears similarity to some negative templates instead of any positive templates. In terms of this view, we can get the most accurate feature for the candidate in order to get the convincing final candidate score. B. Additive Pooling For the i -th candidate, we view the i -th column in the refined similarity map F˜ as a refined discriminative feature: f˜i = [ F˜1i , . . . . . . , F˜ pi , F˜( p+1)i , . . . . . . , F˜(n+ p)i ] ,

(23)

and we have two sub features each representing the candidate’s resemblance to the positive and negative templates: f˜i− pos = [ F˜1i , . . . . . . , F˜ pi ] , f˜i−neg = [ F˜( p+1)i , . . . . . . , F˜(n+ p)i ]

(24)

from which we calculate the candidate’s confidence in being the true target object si through an intuitive additive pooling

method which consists of two steps. First, we separately plus the largest l coefficients in f˜i− pos and f˜i−neg to get the scores si− pos and si−neg indicating what extent can the i -th candidate be related to the positive and the negative template sets. This process can be concluded as the following: si− pos = L(f˜i− pos , 1)+, · · · + L(f˜i− pos , l), si−neg = L(f˜i−neg , 1)+, · · · + L(f˜i−neg , l)

(25)

˜ k) denotes the k-th largest element in f˜ and in where L(f, this work we set l half the number of positive templates. Discarding the small map values that may come from uncertain interference ensures that we get more robust scores. Second, the discriminative score for the i -th candidate is formulated by si = si− pos − si−neg

(26)

and the score set for all the candidates is denoted as S = {si }i=1,...,m . This formulation is based on the assumption that a candidate with a larger foreground score and a smaller background score is more likely to be the target object, and vice versa. Namely, a target observation should have large discriminative score while a bad candidate has a relatively small one. Thus the additive pooling process is completed after two steps defined by Formulation 25 and 26. The likelihood of the observation yi being the target at state x t can be constructed within the Bayesian framework by p(yi |x t ) ∝ si

(27)

Finally, the target observation yt can be located by maximizing p(yt |x t ) = max p(yi |x t )

(28)

We give a summary of this additive pooling scheme in Fig. 4. Here, we could notice that some discriminative scores are similar to each other. This result is rational because we sample numerous candidates, and inevitably, some of them share the similar features, which leads to the similar responses to the additive pooling scheme. V. I MPORTANT I MPLEMENTATION S CHEMES To make this work clear and complete, we will briefly introduce some less novel but rather important implementation schemes in our work. A. Locally Normalized Features In this work we adopt locally normalized features to withstand partial occlusion and moderate appearance variation. An observed image patch A is partitioned into E local patches, each of which is independently expressed in gray scale values, vectorized and normalized to be a vector with unit 2 norm. Then we concatenate these local feature vectors so that the global structural information is maintained. The candidates and templates in this work are all represented with this locally normalized features to handle partial occlusion and to moderate appearance variation.

1878

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 4. This figure intuitively illustrates how to get the discriminative scores for all candidates and choose the best candidate state based on it. (a) The weighted DSS map. (b) Two score vectors after the first step of additive pooling, and they respectively indicate the degree of resemblance to the positive (upper one) and the negative (bottom one) template set for all candidates. (c) The final discriminative score vector after the second step of additive pooling. (d) The optimal state corresponding to the candidate that scores the highest.

B. Initial Discriminative Template Sets The first tracking result is a manually chooser rectangle area. Let us assume point Q(h, v) be the center of the rectangle region, and we sample p patches as the initial positive templates around Q(h, v) within a circular area which satisfies ||Q i −Q(h, v)|| < ϕ, where Q i is the center of the i -th sampled patch. Similarly, the initial negative template set, that is updated dynamically, is sampled from the annular region ϕ < ||Q j − Q(h, v)|| < a few pixels away from Q, where Q j is the center of the j -th sampled image, ϕ and are the inner and outer radius of the annular region respectively. C. Update Scheme For the positive template set, as the target in the first frame is always the ground-truth, we keep the first template in the positive template set unchanged to alleviate the drifting problem. We denote η = [η1 , η2 , . . . , η p ] as the similarity vector and set a threshold θ to describe the degree of similarity. In each frame, we measure the similarity ηi between the current tracking result and the i -th positive training template by applying the Euclidean distance. Then we compare the maximum similarity value  = max ηi , i = 1, 2, . . . , m with the threshold θ . If  > θ , then we use the tracking result to replace the corresponding positive template which has the largest similarity with the new target appearance. Otherwise, it means there is an incredibly large appearance change in adjacent frames or a significant part of the target object is occluded. Then, we discard this bad sample without update. On the other hand, for the negative templates, although the background information varies a lot along the tracking process, we only sample negative templates around the tracking result in the last frame. Since the backgrounds of two successive frames are quite similar, the negative templates could be well represented by the current candidates that contain much background information. In this way, these bad candidates would achieve lower scores in the following pooling step without being considered as possible tracking result for they take part in representing negative templates. VI. E XPERIMENTS The proposed algorithm is implemented in MATLAB and runs at 2 frames per second on a 2.5 GHz i5-2450M Core

PC with 4GB memory. The parameters, which are fixed for each sequence, are summarized as follows. In Eq. 9, the sparse regularization constant λ is set to be 0.04 and the Laplacian constraint δ is 0.8. The iteration number is 5 and the Lipschitz constant ξ is equal to 1/0.00018, respectively. The variables p and n (the number of positive and negative templates) are set to be 10 and 150 respectively. The update threshold θ is 0.4. We resize the target image patch to 32×32 pixels and extract 4×4 local patches within the target region. We update both positive templates and negative templates in each frame. The MATLAB source code and datasets will be made available on our website (http://ice.dlut.edu.cn/lu/publications.html). A. Key Component Validation In this section, we qualitatively discuss the effect of the Laplacian constraint term and the negative templates. It is worth noticing that the OWN (our algorithm without the negative template set) and the OWL (our algorithm without the Laplacian constraint) perform relatively good as well, from which we can conclude that the overall framework is effective. But without negative templates or the Laplacian constraint, the robustness of our tracker indeed decreases to some extent. As is shown in Tables I and II, all the results of the proposed algorithm are better than the ones of the OWN and the OWL. Compared with OWL, we can come to a conclusion that the Laplacian constraint serves to increase the stability of the proposed algorithm. The OWN tracker performs relatively poor compared with OWL and the proposed algorithm in those sequences undergoing heavy occlusion or severe background clutter, which can demonstrate the significant role of negative templates in handling occlusion and segregating the foreground target from the background. B. Quantitative Evalution We use fifteen challenging videos in the experiment to evaluate the performance of the proposed algorithm. The challenging factors of these videos include heavy occlusion, motion blur, pose variation, background clutter and illumination change. The proposed approach is compared with eleven state-of-the-art algorithms, including the IVT [7], APGL1 [2], PN [8], VTD [9], MILTrack [10], FragTrack [11], MTT [5], OSPT [1], ASLAS [4], LSAT [3] and SCM [6] methods. Also two extra algorithms, OWL and OWN, are introduced for

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

1879

TABLE I C OMPARISON R ESULTS IN T ERMS OF AVERAGE C ENTER E RROR ( IN P IXELS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE , AND G REEN F ONTS . (The Last Two Columns are for Self Comparison and do not Participate in Ranking)

TABLE II C OMPARISON R ESULTS IN T ERMS OF AVERAGE OVERLAP R ATE ( IN P IXELS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE , AND G REEN F ONTS . T HE L AST R OW S HOWS C OMPARISON R ESULTS A BOUT C OMPUTATIONAL L OADS IN T ERMS OF Fps. (The Last Two Columns are for Self Comparison and do Not Participate in Ranking)

self-comparison. For fair evaluation, we use the source code provided by the authors and run these codes with the same initial position of the target. For the purpose of assessing the performance of the proposed tracker, two criteria, the center location error as well as the overlap rate, are employed in our paper. It should be noted that a smaller average error or a bigger overlap rate means a more accurate result. Given the tracking result of each frame RT and the corresponding ground truth RG , we can get the overlap rate by the PASCAL VOC [26] criterion, T ∩RG ) scor e = area(R area(RT ∪RG ) . Tables I and II report the quantitative comparison results in terms of the average center location errors and average overlap rates respectively. As shown in the tables, the proposed tracker yields favorable performance against other state-of-the-art methods. Regarding the computational loads, in the last row of Table II we report the comparison result in terms of fps,

which is obtained by running all the algorithms on computers with same configuration and using the same dataset for fair comparison. From the result we could tell that although the sparse representation based trackers (the APGL1, MTT, OSPT, ASLAS, LSAT, SCM trackers and the proposed tracker) are slower than some classic trackers like the IVT and MIL trackers, they generally yield superior performance. Among the sparsity based trackers, the proposed tracker is best in terms of accuracy and second in speed, striking a good balance between performance and computational load. C. Qualitative Evaluation As the sparse trackers generally perform better than the other state-of-the-art methods and they are more related to our work, we only demonstrate the comparison results with them in Fig. 5, including the OSPT [1], APGL1 [2], LSAT [3], ASLAS [4], MTT [5], SCM [6] and the proposed method.

1880

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 4, APRIL 2014

Fig. 5. Sample tracking results on fifteen challenging sequences. (a) Occlusion1 and Woman with heavy occlusion and in-plane rotation. (b) Caviar1 and Caviar2 with heavy occlusion and in-plane rotation. (c) Face, Jumping and Deer with abrupt motion. (d) DavidIndoor, Singer1 and Car4 with illumination change. (e) Sylvester2008b, Girl and Dudek with pose variation. (f) Cliffbar and Car11 with background clutter.

Heavy Occlusion: We test four sequences (Occlusion1, Woman, Caviar1, Caviar2) characterizing in having either severe occlusion or long-time partial occlusion. Fig. 5(a) and (b) confirms the truth of the robustness of the proposed algorithm in dealing with rotation and scale change when the target undergoes heavy occlusion. Since two sub discriminative features are formulated to evaluate the candidate’s similarity to the positive and the negative template set respectively, although a good candidate with occlusion bears some similarity with the background, other misaligned candidates bears more, which make their final scores lower than the good candidate. What’s more, as we use the particles to reconstruct the templates, the influence from occluded parts of the particles is effectively suppressed for the reason that they contribute little to the reconstruction progress. Motion Blur: Fig. 5(c) presents the tracking results on the sequences Face, Jumping and Deer. As the target object undergos abrupt motion, it is rather tough to accurately locate its position and account for the blurs which reduce the discriminative information in feature vectors. It is worth noticing that the proposed method performs better than other algorithms. Thanks to the discriminative template set and the update scheme, it is easier for our tracker to maximally capture the appearance change information in and near the target area and accurately select the target from the background even with

limited discriminative information in the feature vectors when the blurs occurs. Illumination Change: Fig. 5(d) demonstrates the tracking results on the sequences DavidIndoor, Singer1 and Car4 with drastic illumination change. Our tracker can successfully tail the target throughout entire sequences, which can be attributed to the locally normalized features that have great effect in resisting the light change. We also observe that due to the template update strategy with the incremental subspace learning which enables the tracker to capture light change, the ASLAS algorithm achieves good performance in these sequences as well. Rotation: The sequences, Sylvester2008b, Girl and Dudek, involving both in-plane and out-of-plane rotations are reported in Fig. 5(e). As we use the affine transformation parameters that include the rotation angle modeling, we can capture the rotating candidates for further selection. We also observe that some trackers do not adapt to scale or in-plane rotation (e.g., LSAT, APGL1 and MTT). Background Clutter: Fig. 5(f) shows the tracking results in the Cliffbar and Car11 with complex background. By introducing both the positive template set and the negative template set to model the foreground and the background information respectively, we can obtain enough discriminative information

ZHUANG et al.: VISUAL TRACKING VIA DSS MAP

and store them in the DSS map. Meanwhile, the additive pooling method effectively extract the discriminative information in the DSS map and enables our method to accurately calculate the discriminative scores and find the optimum candidate. VII. C ONCLUSION In this paper, we propose an efficient tracking algorithm based on a discriminative sparse similarity map which is obtained via a multi-task reverse sparse coding approach with Laplacian constraint. The proposed formulation enjoys advantages including light computational load through using a customized APG method and ideal stability by incorporating a Laplacian term. The employment of dynamically updated positive and negative template sets supplies our tracker with sufficient discriminative information, which is stored in the DSS map and accurately integrated via an additive pooling scheme. Both quantitative and qualitative evaluations against several state-of-the-art algorithms based on challenging image sequences demonstrate the accuracy and the robustness of the proposed tracker.

1881

[21] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Low-rank sparse learning for robust visual tracking,” in Proc. ECCV, 2012, pp. 470–484. [22] H. Liu and F. Sun, “Visual tracking using sparsity induced similarity,” in Proc. 20th ICPR, 2010, pp. 1702–1705. [23] D. Wang and H. Lu, “On-line learning parts-based representation via incremental orthogonal projective non-negative matrix factorization,” Signal Process., vol. 93, no. 6, pp. 1608–1623, 2013. [24] D. Wang, H. Lu, and M.-H. Yang, “Least soft-threshold squares tracking,” in Proc. CVPR, 2013, pp. 2371–2378. [25] S. Gao, I. W.-H. Tsang, L.-T. Chia, and P. Zhao, “Local features are not lonely–Laplacian sparse coding for image classification,” in Proc. CVPR, 2010, pp. 3555–3561. [26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.

Bohan Zhuang is currently pursuing the B.E. degree with the School of Information and Communication Engineering, Dalian University of Technology, Dalian, China.

R EFERENCES [1] D. Wang, H. Lu, and M.-H. Yang, “Online object tracking with sparse prototypes,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 314–325, Jan. 2013. [2] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker using accelerated proximal gradient approach,” in Proc. CVPR, 2012, pp. 1830–1837. [3] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking using local sparse appearance model and k-selection,” in Proc. CVPR, 2011, pp. 1313–1320. [4] X. Jia, H. Lu, and M. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proc. CVPR, 2012, pp. 1822–1829. [5] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in Proc. CVPR, 2012, pp. 2042–2049. [6] W. Zhong, H. Lu, and M. Yang, “Robust object tracking via sparsitybased collaborative model,” in Proc. CVPR, 2012, pp. 1838–1845. [7] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 125–141, 2008. [8] Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: Bootstrapping binary classifiers by structural constraints,” in Proc. CVPR, 2010, pp. 49–56. [9] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proc. CVPR, 2010, pp. 1269–1276. [10] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Proc. CVPR, 2009, pp. 983–990. [11] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in Proc. CVPR, 2006, pp. 798–805. [12] S. Hare, A. Saffari, and P. H. Torr, “Struck: Structured output tracking with kernels,” in Proc. ICCV, 2011, pp. 263–270. [13] M. Godec, P. Roth, and H. Bischof, “Hough-based tracking of non-rigid objects,” in Proc. ICCV, 2011, pp. 81–88. [14] E. G. Learned-Miller and L. S. Lara, “Distribution fields for tracking,” in Proc. CVPR, 2012, pp. 25–33. [15] F. Yang, H. Lu, and M. Yang, “Robust visual tracking via multiple kernel boosting with affinity constraints,” IEEE Trans. Circuits Syst. Video Technol., vol. 24, no. 2, pp. 242–254, Jul. 2013. [16] F. Yang, H. Lu, and M.-H. Yang, “Learning structured visual dictionary for object tracking,” Image Vis. Comput., vol. 31, no. 12, pp. 992–999, 2013. [17] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic tracking,” in Proc. ECCV, 2002, pp. 661–675. [18] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE TPAMI, vol. 25, no. 5, pp. 564–577, May 2003. [19] X. Mei and H. Ling, “Robust visual tracking using 1 minimization,” in Proc. ICCV, 2009, pp. 1–10. [20] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai, “Minimum error bounded efficient 1 tracker with occlusion detection,” in Proc. CVPR, 2011, pp. 1257–1264.

Huchuan Lu (SM’12) received the Ph.D. degree in system engineering and the M.Sc. degree in signal and information processing from the Dalian University of Technology (DUT), Dalian, China, in 2008 and 1998, respectively, where he joined the faculty in 1998 and is currently a Full Professor with the School of Information and Communication Engineering. His current research interests include the areas of computer vision and pattern recognition with focus on visual tracking, saliency detection, and segmentation. He is a member of the ACM and an Associate Editor of the IEEE T- SMC PART: B .

Ziyang Xiao received the B.E. degree in electronic engineering from the Dalian University of Technology, Dalian, China, in 2011, where she is currently pursuing the master’s degree with the School of Information and Communication Engineering. Her research interest is in object tracking.

Dong Wang received the B.E. degree in electronic information engineering and the Ph.D. degree in signal and information processing from the Dalian University of Technology (DUT), Dalian, China, in 2008 and 2013 respectively, where he is currently a faculty with the School of Information and Communication Engineering. His research interests include face recognition, interactive image segmentation, and object tracking.

Visual tracking via discriminative sparse similarity map.

In this paper, we cast the tracking problem as finding the candidate that scores highest in the evaluation model based upon a matrix called discrimina...
4MB Sizes 3 Downloads 3 Views