Visual Tracking via Weighted Local Cosine Similarity.

1838

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

Visual Tracking via Weighted Local Cosine Similarity Dong Wang, Huchuan Lu, Senior Member, IEEE, and Chunjuan Bo

Abstract—In this paper, we propose a novel weighted local cosine similarity (WLCS) and apply it to visual tracking. First, we present the local cosine similarity to measure the similarities between the target template and candidates, and provide some theoretical insights on it. Second, we develop an objective function to model the discriminative ability of local components, and use a quadratic programming method to solve the objective function and to obtain the discriminative weights. Finally, we design an effective and efficient tracker based on the WLCS method and a simple update manner within the particle filter framework. Experimental results on several challenging image sequences show that the proposed tracker achieves better performance than other competing methods. Index Terms—Cosine similarity, discriminative weights, local similarity, object tracking.

I. I NTRODUCTION S ONE of the most active research topics in computer vision, online visual tracking plays a key role in computer vision and has a wide range of applications, such as motion analysis, activity recognition, video surveillance, traffic control, driver assistance system, human-computer interaction, and so on [1]. The main challenging factors for robust tracking include partial occlusion, illumination variation, pose change, motion blur, background clutter, and many more. Thus, it is a very tough task to design a robust tracking algorithm. Generally speaking, an online tracking algorithm is composed of three components. 1) A motion model (or dynamic model) which models the temporal consistency of the states of an object and supplies the tracker with a number of candidate states (e.g., the random walk model [2], [3]).

A

Manuscript received January 9, 2014; revised July 25, 2014 and September 26, 2014; accepted September 26, 2014. Date of publication November 21, 2014; date of current version August 14, 2015. This work was supported in part by the China Post-Doctoral Science Foundation under Grant 2014M551085, in part by the Fundamental Research Funds for the Central Universities under Grant DUT14YQ101 and Grant DUT13RC(3)105, in part by the Natural Science Foundation of China under Grant 61472060, in part by the Joint Foundation of China Education Ministry, and in part by the China Mobile Communication Corporation under Grant MCM20122071. This paper was recommended by Associate Editor D. Goldgof. D. Wang and H. Lu are with the School of Information and Communication Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]; [email protected]). C. Bo is with the School of Information and Communication Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China, and also with the College of Electromechanical and Information Engineering, Dalian Nationalities University, Dalian 116600, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2360924

2) An appearance model (or visual model), which represents the tracked object and evaluates the likelihood of each candidate state in the current frame. 3) An optimization strategy that combines the visual model with the motion prior that is obtained in the previous frames (e.g., Kalman filter [4] and particle filter [2], [3]). In this paper, we use the random walk model as the motion prior and adopt the particle filter framework to combine motion and appearance models, and therefore, focus on designing an effective and efficient appearance model (as it usually plays the most crucial role of any tracking algorithm). Existing appearance models can be usually categorized into either generative (see [5], [6]), discriminative (see [7], [8]), or hybrid generative discriminative (see [9], [10]) methods. Generative tracking methods aim to learn a visual model that represents the appearance of the tracked object and search for image regions which are the most similar to the tracked object, including template-based (see [5], [11], [12]), subspace-based (see [6], [13]), and sparse representation-based (see [14]–[17]) models. In the template-based algorithms, the tracked object is described by one single template [4] or multiple templates [11]. Then, the tracking problem can be considered as searching for the regions with highest matching scores (smallest matching distances). The incremental visual tracking (IVT) method [6] is one of the most popular subspace-based trackers. It represents the tracked object by using a low dimensional principal component analysis (PCA) [18] subspace and online updates the PCA subspace to capture the appearance change of the target. Although, the IVT method is robust to illumination and pose changes, it is very sensitive to partial occlusion and background clutter. Inspired by the idea of sparse representation [19], Mei and Ling [14] develop a novel 1 tracker, which uses a series of object and trivial templates to represent the tracked object with sparsity constraints. Furthermore, several methods improve the original 1 tracker in terms of both speed and accuracy by using accelerated proximal gradient algorithms [16], replacing raw pixel templates with orthogonal basis vectors [20], and modeling the similarities between different candidates [21], [22], to name a few. Although, the sparse representation-based trackers achieve very accurate results, their tracking speeds are usually not satisfying due to complicated sparse optimization methods. Discriminative tracking methods focus on distinguishing the tracked targets from its surrounding backgrounds by considering both positive and negative samples (e.g., cast tracking as a binary classification problem with local search). Both classic and recent classification techniques could promote the progress

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

WANG et al.: VISUAL TRACKING VIA WLCS

1839

of tracking algorithms, including boosting [23], support vector machine [9], naive Bayes [24], random forest [25], multiple instance learning (MIL) [7], metric learning [26], structured learning [27], and so on. It has been shown that generative models achieve higher generalization when limited data is available [28], while discriminative models perform better if the training set size is large [29]. Thus, several hybrid generative discriminative tracking algorithms [9], [10] have also been proposed to exploit the advantages of both generative and discriminative models. In this paper, we present a novel tracking algorithm based on weighted local cosine similarity (WLCS), which belongs to the hybrid generative discriminative methods. For one thing, we propose the local cosine similarity to match the target template and candidates, and treat tracking as a template matching problem. For another, we present an effective method to improve the discriminative ability of the WLCS method by learning discriminative weights. Based on the WLCS method and a simple update manner, we develop an effective and efficient tracking method within the particle filter framework. Numerous experiments on challenging image sequences are conducted to compare the proposed tracking method with other competing trackers and to discuss the effects of critical parameters. The rest of this paper is structured as follows. In Section II, we briefly review some background information and related work. Section III presents the WLCS, including its motivation, derivation, solution, and discussion. Section IV introduces the tracking framework based on the proposed WLCS method. The experimental results are presented in Section V. Finally, Section VI concludes this paper. II. BACKGROUND AND R ELATED W ORK A. Particle Filter Particle filter [30] is a Bayesian sequential importance sampling technique that aims to estimate the posterior distribution of state variables for a given dynamic system. It uses a set of weighted particles to approximate the probability distribution of the state regardless of the underlying distribution, which is very effective for dealing with nonlinear and nonGaussian systems. As a typical dynamic state inference problem, online visual tracking can be modeled by particle filter [6]. There exist two fundamental steps in the particle filter method: 1) prediction and 2) update. Let xt denote the state variable describing the affine motion parameters of an object and yt denote its corresponding observation vector (the subscript t indicates the frame index). The two steps recursively estimate the posterior probability based on the following two rules: (1) p (xt |y1:t−1 ) = p (xt |xt−1 )p (xt−1 |y1:t−1 ) dxt−1 p (xt |y1:t ) =

p (yt |xt ) p (xt |y1:t−1 ) p (yt |y1:t−1 )

(2)

where x1:t = {x1 , x2 , . . . , xt } stand for all available state vectors up to time t and y1:t = {y1 , y2 , . . . , yt } denote their corresponding observations. p(xt |xt−1 ) is called the motion model that describes the state transition between consecutive

frames, and p(yt |xt ) denotes the observation model that evaluates the likelihood of an observed image patch belonging to the object class. In the particle filter framework, the posterior p(xt |y1:t ) is approximated by N weighted particles {xit , wit }i=1,...,N , which are drawn from an importance distribution q(xt |x1:t−1 , y1:t ), and the weights of the particles are updated as p yt |xit p xit |xit−1 i i . (3) wt = wt−1 q(xt |x1:t−1 , y1:t ) In this paper, we adopt q(xt |x1:t−1 , y1:t ) = p(xt |xt−1 ), which is assumed as a Gaussian distribution similar to [6]. In detail, six parameters of the affine transform are used (i.e., xt = {xt , yt , θt , st , αt , φt }, where xt , yt , θt , st , αt , φt denote x, y translations, rotation angle, scale, aspect ratio, and skew, respectively). The state transition is formulated by random walk, i.e., p(xt |xt−1 ) = N (xt ; xt−1 , ), where is a diagonal covariance matrix. Finally, the state xt is estimated as i xi . We note that the key of designing a practi w xt = N t t i=1 cal tracking algorithm is to develop an effective and efficient observation likelihood p(yt |xt ). B. Object Tracking Based on Local Model The classic tracking algorithms are usually developed based on the global model [4], [6]. In practice, these methods are able to handle many challenging factors (e.g., illumination variation, scale change, motion blur, etc.) effectively and efficiently. However, they often cannot deal with partial occlusion and background clutter. To address these issues, several local-based models have been employed for designing robust tracking algorithms [5], [15], [31]–[33]. Adam et al. [5] propose a fragment-based tracking method using histograms, in which the tracked object is divided into a set of nonoverlapping blocks and the integral histogram technique is adopted to speed up the matching process of different histograms. Recently, He et al. [33] extend the idea of fragment and present the concept of local sensitive histogram for dense matching. On the other hand, there also exist many tracking algorithms based on overlapping blocks (see [15], [32]). The tracking methods presented in [15] and [32] represent the tracked target with a single histogram which is generated by a series of sparse codes of local image patches and the pooling techniques. In this paper, we propose a WLCS to measure the similarities between the target and candidates, which also aims to develop a tracking method based on local model. Compared with previous works, the proposed method presents some theoretical insights on the local cosine similarity, and furthermore learns discriminative weights to improve the discriminative ability which may facilitate the tracking problem in real applications. In addition, we note that the proposed WLCS method is different from [34], i.e., the least soft-threshold squares (LSS) method. First, the LSS method is a holistic model, which aims to define a robust distance between noisy observation samples and the template subspace; while this paper proposes a local cosine similarity to measure the similarities between the target template and candidate samples, which is designed based on local model. Second, the LSS method is a generative model,

1840


good and bad candidates (shown in red and blue boxes, respectively). We denote the template as t, the good candidate as yG and the bad candidate as yB , respectively. We can see that s(yG , t) < s(yB , t) in this example, which means the bad candidate is chosen if the cosine similarity is adopted for template matching. On the other hand, the good candidate is selected (sL (yG , t) > sL (yB , t)) when the local cosine similarity is used.1 We note that the local cosine similarity is more effective than the cosine similarity in handling partial occlusion. For general purpose, we introduce the WLCS that is defined as sWL (y, t) =

M

wi s (yi , ti ) =

i=1

Fig. 1. Toy example of good and bad candidates for template matching. (a) Representative image. (b) Template. (c) Good candidate. (d) Bad candidate.

which merely considers the information of positive samples and completely ignores the background information; while the proposed WLCS method enhances the discriminative ability of the local model by learning discriminative weights, which exploits both foreground and background information.

A. Local Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space, which measures the cosine of the angle between them. The cosine similarity [35], [36] is very common used in many fields such as information retrieval, pattern matching, data mining, and so on. For the tracking problem, we denote t ∈ Rd×1 as the template of the tracked object. Given an observed image vector of a candidate y ∈ Rd×1 , the cosine similarity is defined as y, t s (y, t) = y t

(4)

where . denotes 2 -norm. We note that the cosine similarity is one of the global matching methods, thus it is usually sensitive to some impulse noise (e.g., partial occlusion). Motivated by the ideas of local models (Section II-B), we present a local cosine similarity to achieve a good matching. First, we reorganize the candidate vector y as the concatenation of M local feature vector y = [y 1 , y2 , . . . , yM ] , where yi ∈ l×1 R denotes a column vector denoting the ith local block of the image and M = d/l. Likewise, the template t can be represented the same way as a column vector by its blocks. Then the local cosine similarity can be defined as M M 1 1 yi , ti s (yi , ti ) = yi ti M M i=1

wi

i=1

yi , ti yi ti

(6)

where wi is a positive weight of the ith component that satisfies M i=1 wi = 1. In the remark, we can see that both cosine similarity and local cosine similarity can be viewed as special cases of the WLCS by choosing different weights. Remark 1: Both cosine similarity (4) and local cosine similarity (5) can be viewed as special cases of the WLCS (6) by choosing different weights. 1) The cosine similarity can be viewed as a special case of the WLCS by adopting local brightness ratios as weights. The following equation provides a detailed explanation: y, t y t M yi ti yi , ti = y t yi ti

s (y, t) =

III. WLCS

sL (y, t) =

M

(5)

i=1

where s(yi , ti ) measures the local similarity between ti and yi . Fig. 1 demonstrates a toy example of good and bad candidates for template matching with partial occlusion. A single template is shown in Fig. 1(b). Fig. 1(c) and (d) shows

i=1

=

M i=1

=

M

wi

yi , ti yi ti

wi s (yi , ti )

(7)

i=1 y

y

where wi = (yi ti )/(yt) = ri rit , ri = yi /y and rit = ti /t. It can be seen that the cosine similarity intrinsically assigns larger weights to local components with higher brightness ratio, the imperfections of which are two folds. First, a brighter component does not mean it is more significant or discriminative for template matching. Second, if a candidate y suffers some impulse noises (e.g., partial occlusions and local y illumination changes), all local brightness ratio {ri }M i=1 may change significantly. As these changes are random and unpredictable, the cosine similarity is usually not effective in handling impulse noises. 2) By choosing uniform weights (i.e., wi = (1/M), i = 1, 2, . . . , M), the WLCS is reduced to the local cosine similarity. When some impulse noises occur, the changes of some local components do not affect those components that do not suffer impulse noises. Thus, it can restrict the negative influence of a local component with errors rather than all components. 1 We use 4×4 (i.e., M = 4×4 = 16) blocks for calculating the local cosine similarity.


Fig. 2.

Manner of sampling positive and negative samples.

Fig. 3.

Flowchart of the proposed tracking method.

1841

TABLE I E VALUATED I MAGE S EQUENCES

Fig. 4. Representative results when the tracked objects experience severe occlusion. (a) Occlusion1. (b) Occlusion2. (c) Caviar1. (d) Caviar2.

3) The WLCS can be generalized by learning weights when it is used for some special purpose. The next subsection will introduce how to learn discriminative weights for the tracking problem. 4) We note that the proposed local cosine similarity is very related to the local normalized squared 2 distance, the

detailed illustration of which is as follows: M yi ti 2 − yi ti i=1

M yi 2 ti 2 yi ti = + −2 , yi ti yi ti i=1 M yi , ti 2−2 = yi ti i=1 = 2M 1 − sL (y, t) .

1842


Fig. 5. Representative results when the tracked objects experience illumination variation and pose change. (a) Singer1. (b) Car4. (c) DavidIndoor. (d) Sylvester. (e) Dudek. (f) Girl.

B. Learning Discriminative Weights via Quadratic Programming Based on the discussions above, it is very important to determine (or learn) effective weights for the WLCS method. Thus, we develop an optimization method to calculate discriminative weights for the tracking problem. Once we obtain the optimal state in the current frame, we can collect some positive samples and negative samples. The manner of sampling positive and negative samples is illustrated in Fig. 2. The positive samples are densely cropped + = {y|(c(y) − c(y∗ ) ≤ α)} within a search radius α centering at the center location of the tracked object. Then, the negative samples are randomly sampled from the set − = {y|(β ≤ c(y) − c(y∗ ) ≤ γ )}, where α < β < γ. Now, we consider exploiting these positive and negative samples to learn discriminative weights for the WLCS method. For this purpose, we define the following objective function: 1 1 J (w) = + sWL t, yj − − sWL t, yj + − j∈

M 2 μ + wi − wi

2 i=1

where the first term (1/|+ |) j∈+ sWL (t, yj ) denotes the average WLCS of positive samples and the second term (1/|− |) j∈+ sWL (t, yj ) stands for the average WLCS of negative samples. The last term is a regularization term in which w = [w 1 , w 2 , . . . , w M ] acts as a reference vector. The physical meaning of the first two terms in (8) is to make the WLCS values of positive samples large and the WLCS values of negative samples small in the mean while. The regularization term aims to avoid model degradation, or to introduce some prior information (such as temporal consistency). Thus, we can obtain the discriminative weights by solving the following optimization problem: max J (w) M wi = 1 s.t. i=1

wi ≥ 0, i = 1, . . . , M.

(9)

It is not easy to prove that the maximization problem (9) is equivalent to the following minimization problem: min G (w) M wi = 1 s.t.

j∈

(8)

i=1

wi ≥ 0, i = 1, . . . , M.

(10)


1843

The objective function G(w) is defined as G (w) =

M i=1

M μ wi Si− − Si+ + μw i − w2i . 2

(11)

i=1

The detailed derivations in the Appendix, can be found j where Si+ = (1/|+ |) j∈+ s(ti , yi ) and Si− = (1/|− |) j j∈− s(ti , yi ). For the optimization problem (10), the objective function G(w) is a quadratic function and the constraint functions are linear. So, this optimization problem is a quadratic program, which can be solved by many existing software and packages, such as “quadprog.m” in the MATLAB optimization toolbox. We note that the regularization parameter μ is very important for the proposed optimization problem and should be set as an appropriate value. If the value of μ is very large, the solution of w will be very close the reference weight vector w (especially, μ = +∞ will lead to w = w ). On the other hand, if the value of μ is very small (i.e., very close to 0), the solution of w will intend to select only one single local component, which may make the tracker less robust. When μ= 0, the objective function can converted M − + into G(w) = i=1 wi (Si − Si ), and the solution of the problem (10) can be simply obtained as 1, if Si+ − Si− > Sj+ − Sj− ∀ j = i wi = (12) 0, otherwise. IV. WLCS-BASED T RACKING F RAMEWORK In this paper, visual tracking is treated as a particle filter processing with a hidden Markov model [6]. Here, we present the basic components of the proposed tracker, the flowchart of which is shown in Fig. 3. A. Observation Model By assuming that the variation of the indicator vectors between two consecutive frames is very small, we build our likelihood function based on the WLCS method with the weight vector wt−1 = [w1t−1 , w2t−1 , . . . , wM t−1 ] obtained in the last frame. For each observed image vector corresponding to a predicted state, the observation likelihood can be measured by M j wit−1 s ti , yi p yj |xj =

(13)

Fig. 6. Representative results when the tracked objects experience fast motion and background clutter. (a) Deer. (b) Leno. (c) Car11. (d) Stone.

i=1

where j denotes the jth sample of the state x (the frame index t is dropped without loss of generality). B. Online Update In this paper, the online update scheme includes two aspects. 1) Online Update the Template: After obtaining the best candidate state of the tracked target in the current frame (x∗ , we drop the frame index t for clarity). Then, we extract its corresponding image blocks y∗ = [(y∗1 ) , (y∗2 ) , . . . , (y∗M ) ] to update the target tem plate (t = [t 1 , t2 , . . . , tM ] ) as ti ← ηti + (1 − η) y∗i , if s ti , y∗i ≥ ε (14) otherwise ti ← ti ,

where s(., .) denotes the cosine similarity, i.e., s(a, b) = a, b/(ab). ε = 0.85 is a predefine threshold and η is a update rate that is set as 0.95 in this paper. 2) Online Update Discriminative Weights: After updating the object template t, we can collect the positive and negative samples as the manner in Section III-B and then solve the optimization problem (11) to obtain the discriminative weight vector wt in the current frame. It should be noted that we choose the discriminative weight vector from the last frame as the reference weight, i.e., w = wt−1 . This manner is able to consider the temporal consistency of the discriminative weights between

1844


(a)

(b)

(d)

(e)

(c)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

Fig. 7. Center error plots of different tracking algorithms on 14 challenging image sequences, where the proposed algorithm is compared with six state-of-theart tracking methods, including the FragT, IVT, MIL, VTD, OAB, and LSHT trackers. (a) Occlusion1. (b) Occlusion2. (c) Caviar1. (d) Caviar2. (e) Singer1. (f) Car4. (g) DavidIndoor. (h) Sylvester. (i) Dudek. ( j) Girl. (k) Deer. (l) Leno. (m) Car11. (n) Stone.

consecutive frames. We also note that the discriminative weights in the first frame is initialized as equal values (wi = (1/M), i = 1, 2, . . . , M).

V. E XPERIMENTS The proposed tracker is implemented in MATLAB 2009B on a PC with Intel i7-3770 CPU (3.4 GHz) with 32 GB


1845

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

Fig. 8. Center error plots of different tracking algorithms on 14 challenging image sequences, where the proposed algorithm is compared with seven state-of-the-art tracking methods, including the POLSL, TLD, APGL1, MTT, LSAT, ASLAS, and LSST trackers. (a) Occlusion1. (b) Occlusion2. (c) Caviar1. (d) Caviar2. (e) Singer1. (f) Car4. (g) DavidIndoor. (h) Sylvester. (i) Dudek. ( j) Girl. (k) Deer. (l) Leno. (m) Car11. (n) Stone.

memory, and runs 29 frames/s in this platform. We resize each observation sample to 32 × 32 pixels, and then divide it into 4× 4 blocks for achieving local representation. The number of

particles is chosen to be 600 as a trade-off between effectiveness and speed. The numbers of positive samples and negative samples are set to be 10 and 50, respectively, in which the

1846


TABLE II AOR AND ASR (%, IN THE B RACKETS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE , AND G REEN F ONTS

sampling radii are set as α = 1, β = 5, and γ = 10. The regularization parameter μ is set as μ = 0.1 empirically. In this paper, we adopt 14 challenging image sequences from [6], [7], and [11] and the CAVIAR data set [37]. The challenging factors of these sequences include partial occlusion, illumination variation, pose change, background clutter, and motion blur (see Table I for more detailed information). We evaluate the proposed tracker against thirteen state-of-the-art algorithms, including the fragment-based tracking (FragT) [5], IVT [6], MIL [7], visual tracking decomposition (VTD) [11], online Adaboost (OAB) [23], local sensitivity histograms tracking (LSHT) [33], part-based online latent structural learning (POLSL) [38], tracking learning detection (TLD) [8], accelerated proximal gradient L1 (APGL1) [16], multitask tracking (MTT) [21], local sparse appearance tracking (LSAT) [15], adaptive structural local sparse appearance (ASLSA) [32], and least soft-threshold squares tracking (LSST) [34] algorithms. For fair evaluation, we use the source codes provided by the authors and run them with adjusted parameters. For some trackers that adopt random sampling to generate candidate samples (including MIL, VTD, OAB, TLD, APGL1, and MTT methods), we run them five times and report the best results in this paper. Although, the IVT, ASLSA, LSST, and proposed WLCS algorithms adopt the particle filter technique as the tracking framework, they do not achieve random results due to its implementation [i.e., use “rand(’state’,0); randn(’state’,0);” MATLAB commands].

A. Qualitative Evaluation 1) Severe Occlusion: Fig. 4 demonstrates the tracking results in the sequence (Occlusion1, Occlusion2, Caviar1, and Caviar2) with heavy partial occlusion, scale change and inplane rotation. We can see from this figure that the proposed method achieves very good performance in terms of position, rotation and scale when the tracked object undergoes severe occlusion. This can be attributed to three main reasons. 1) The local similarity measurement can deal with impulse noises (e.g., partial occlusion) effectively. 2) The discriminative weights are able to stress some discriminative local components, which facilitates distinguishing the tracked object from its local surroundings. 3) The update scheme avoid degrading the observation model by rejecting inappropriate update. In addition, the ASLAS method also performs well in most cases as it includes part-based representations with overlapping patches. The IVT method is very sensitive to partial occlusion (Occlusion2 and Caviar1) as it assumes the observation noises are Gaussian distributed with small variance. This assumption does not hold when occlusion occurs. The IVT method also simply uses new observations for learning new basis vectors without detecting partial occlusion and processing these samples accordingly, which may degrade the observation model gradually. The FragT method performs poorly in handling some challenging occlusion cases (e.g., Occlusion2 and Caviar2) since it does not handle appearance change caused by pose, scale, and occlusion. Although


1847

TABLE III AOR AND ASR (%, IN THE B RACKETS ) OF D IFFERENT S IMILARITY F UNCTIONS . T HE B EST T WO R ESULTS A RE S HOWN IN R ED AND B LUE F ONTS

Fig. 9.

the APGL1 method explicitly considers partial occlusion by using a set of trivial templates, its performance is limited by the used low resolution image patches. The MIL and TLD methods do not perform well when the tracked object is occluded by a similar object (e.g., Caviar1 and Caviar2) as the rectangle features they used are less effective in differentiating similar objects. 2) Illumination Variation and Pose Change: Fig. 5 shows the tracking results in the sequences (Singer1, Car4, DavidIndoor, Sylvester, Dudek, and Girl) with significant illumination variation, scale change, and pose change. We can see that the proposed method works well in these cases. The FragT and MIL methods are not effective in handling scale change (e.g., Singer1, Car4) due to their implementation. The IVT method performs well in most cases (e.g., Singer1, Car4, and DavidIndoor), which can be attributed to the use of incremental PCA algorithm. The TLD method is also able to achieve accurate results as it is equipped with a reinitialization mechanism. 3) Fast Motion and Background Clutter: Fig. 6 illustrates the tracking results on the Deer, Leno, Car11, and Stone sequences, highlighting the fast motion and background clutter. It can be seen that the proposed tracker performs better than other popular methods. In the Leno and Car11 sequences, the IVT method is also able to achieve accurate results in terms of both location and scale. But, it cannot deal with sequences with motion blur (e.g., Deer, Stone). The MIL and TLD methods are easy to drift when the background is cluttered. This can

Fig. 10.

Effects of different block numbers.

Effects of the regularization parameter μ.

be explained by that the rectangle features they used are less effective when the appearances of foreground and background are similar. B. Quantitative Evaluation In this paper, we evaluate the above-mentioned algorithms based on three criteria: 1) the center location error; 2) the overlap rate; and 3) the success rate. The center location error is usually defined as the Euclidean distance between the center locations of tracked objects and their corresponding labeled ground truth. Figs. 7 and 8 demonstrate the center error plots, where a smaller center error means a more accurate result in each frame. Although, the center location error is very intuitive, it cannot take scale and rotation variations of tracked objects into consideration. Thus, the overlap rate and success rate criterion are also adopted to compare state-of-the-art trackers. The overlap rate criteria is proposed in the PASCAL VOC challenge2 to evaluate segmentation algorithms. Given the tracking result (bounding box) of each frame RT and the corresponding ground truth bounding box RG , the overlap score is defined as score = area(RT ∩ RG )/area(RT ∪ RG ). However, this criterion is oversubtle for the tracking problem 2 http://pascallin.ecs.soton.ac.uk/challenges/VOC/

1848


TABLE IV AOR AND ASR (%, IN THE B RACKETS ) OF D IFFERENT “PARTICLE F ILTER ”-BASED T RACKING M ETHODS . T HE B EST T WO R ESULTS A RE S HOWN IN R ED AND B LUE F ONTS

as the manually labeled ground truth may include noises to some extent. Thus, several recent literature (see [24], [33]) adopts the success rate to measure the accuracy of a tracking algorithm. The success rate can be calculated by thresholding the overlap rate (i.e., if the score is larger than 0.5 in one frame, the tracking result is considered as a success in this frame). The average overlap rates (AOR) and average success rates (ASR) is reported in Table II, where a larger value means a more accurate result. From these figures and tables, we can conclude that the proposed tracker achieves more favorable performance than the state-of-the-art methods. C. Discussion and Analysis In this subsection, we investigate the effects of two key parameters (i.e., block number and regularization parameter μ) by using all video sequences mentioned in Table III, and report the average scores [including average center error (ACE), average overlap rate and average success rate (ASR)] in Figs. 9 and 10. 1) Effects of Different Block Numbers: For the proposed WLCS method, the number of blocks is an important parameter. When the number of blocks is too small, the WLCS method is not robust to partial impulse noises (such as partial occlusion and local illumination variation). On the other hand, when the number of blocks is too large, the WLCS method cannot capture enough context information. Fig. 9 illustrate the tracking performance with different block numbers, including ACE, AOR, and ASR plots. we note that the center error values have been normalized in the range from 0 to 1 by cei ← cei /maxi (cei ). Empirical results

show that the proposed algorithm performs best with 4 × 4 blocks. 2) Regularization Parameter μ: As mentioned in Section III-B, the regularization parameter μ is very important for the WLCS method. If μ is too large, the weight vector w may not change after optimization and is not able to capture the discriminative information between foreground and background. On the other hand, if μ is too small, the solution of w may degrade to select one single local component, which makes the tracker be not stable. Fig. 10 illustrates the tracking performance with different μ values, including ACE, AOR, and average success rate (ASR) plots. we note that the center error values have been normalized in the range from 0 to 1 by cei ← cei /maxi (cei ). We can see from this figure that the proposed tracker achieves good performance when the value of μ is around 0.1. Thus, we choose μ = 0.1 as the default parameter of our tracker. 3) Tracking Results of “Particle Filter”-Based Trackers With Multiple Trials: In addition, we note that the particle filter technique is probabilistic method and the tracking results are different in each trial. Thus, it should be better to evaluate the stability of the trackers based particle filter based on multiple trails. In Table IV, we report the tracking results of trackers based on particle filter (including the MTT [21], APGL1 [16], IVT [6], LSST [34], ASLAS [32], and the proposed WLCS trackers), in which the average values and standard deviations obtained by five trials are illustrated. It can be seen from this table that the proposed WLCS method also achieves better performance than its competing algorithms in terms of stability.


1849

D. Comparisons of Different Similarity Functions First, we implement a weighted local normalized crosscorrelation (WLNCC) method to achieve template matching and develop a tracker, which is partitioned as same as the WLCS method and adopts the optimization model (8) to learn the discriminative weights. The WLNCC similarity function is defined as wlncc (y, t) =

M

wi × ncc (yi , ti )

(15)

i=1

where the basic definition of normalized cross-correlation is a − a, b − b ncc (a, b) = . (16) a − ab − b The local normalized cross-correlation (LNCC) method is a special version of the WLNCC method by choosing the weight as wi = (1/M). In addition, we attempt the median local cosine similarity (MLCS) as a likelihood function (17) for visual tracking and report the results in Table III (the second column). In principle, the median of the local similarity scores is robust to less than 50% outliers, thus it can handle occlusion effectively. It can be seen from Table III that the MLCS criterion achieves good performance in dealing with occlusion (e.g., Occlusion1, Occlusion2, Caviar1, and Caviar2). However, it does not work well in handling other challenging factors (e.g., DavidIndoor, Sylvester, and Deer) since it is difficult to introduce the discriminative information in the MLCS criterion mlcs (y, t) = median {s (yi , ti )}. i=1,2,...,M

(17)

VI. C ONCLUSION This paper proposes a novel tracking algorithm based on the proposed WLCS method. First, we present the local cosine similarity to measure the similarities between the target template and candidates, and demonstrate its effectiveness based on some theoretical analysis. Second, we propose an objective function to improve the discriminative ability of the WLCS method and then solve it by using quadratic programming. In addition, we design a tracking algorithm based on the WLCS method and a simple update manner within the particle filter framework. Finally, we conduct numerous experiments on challenging image sequences to compare the proposed method with other state-of-the-art trackers and to discuss the effects of key parameters. These experimental results demonstrate the effectiveness of our tracker and its parameter setting. A PPENDIX Here, we present a detailed derivation from (8) to (11) 1 1 J (w) = + sWL t, y j − − sWL t, y j + − j∈

M 2 μ + wi − wi

2 i=1

j∈

M 1 j = + wi s ti , yi j∈+ i=1

M M μ 2 1 j − − wi s ti , yi + wi − w i 2 i=1 j∈− i=1 ⎤ ⎡ M 1 j = wi ⎣ + s ti , yi ⎦ i=1 j∈+ ⎤ ⎡ M 1 j − wi ⎣ − s ti , yi ⎦ j∈−

i=1

M

μ + 2

w2i

i=1

M M μ 2

wi − μ + wi wi . 2 i=1

i=1

j Let Si+ = (1/|+ |) j∈+ s(ti , yi ) and Si− = (1/|− |) j j∈− s(ti , yi ), the objective function J(w) can be derived as J (w) =

M

wi Si+ −

i=1

M

wi Si−

i=1

μ 2 wi − μ + + w i wi 2 i=1 i=1 i=1 M M M + −

= wi Si − wi Si − μ wi wi μ 2

M

i=1

+

2

M

i=1

M μ

M

M

i=1

w2i

i=1

M μ 2 w2i + wi 2 i=1

μ 2 μ 2 + wi = wi + 2 2 i=1 i=1 i=1

M

M μ − +

2 =− wi Si − Si + μwi − wi 2 +

wi Si+

i=1 M

μ 2

− Si−

− μw i

M

M

i=1

2 wi .

i=1

M − +

By introducing G(w) = i=1 wi (Si − Si + μwi ) − M 2 (μ/2) i=1 wi , we can obtain that max J(w) ⇔ min G(w)

2 as the term (μ/2) M i=1 (wi ) is a constant independent of w. R EFERENCES [1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45, 2006. [2] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1728–1740, Oct. 2008. [3] N. Widynski, S. Dubuisson, and I. Bloch, “Integration of fuzzy spatial information in tracking based on particle filtering,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 3, pp. 635–649, Jun. 2011. [4] D. Comaniciu, V. R. Member, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–575, May 2003. [5] A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Washington, DC, USA, 2006, pp. 798–805.

1850

[6] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 125–141, 2008. [7] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Kyoto, Japan, 2009, pp. 983–990. [8] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012. [9] F. Tang, S. Brennan, Q. Zhao, and H. Tao, “Co-tracking using semisupervised support vector machines,” in Proc. IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 1–8. [10] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsitybased collaborative model,” IEEE Trans. Image Process., vol. 23, no. 5, pp. 2356–2368, May 2014. [11] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, USA, 2010, pp. 1269–1276. [12] Q. Wang, F. Chen, and W. Xu, “Tracking by third-order tensor representation,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 2, pp. 385–396, Apr. 2011. [13] X. Li, W. Hu, Z. Zhang, X. Zhang, and G. Luo, “Robust visual tracking based on incremental tensor subspace learning,” in Proc. IEEE Int. Conf. Comput. Vis., Rio de Janeiro, Brazil, 2007, pp. 1–8. [14] X. Mei and H. Ling, “Robust visual tracking using 1 minimization,” in Proc. IEEE Int. Conf. Comput. Vis., Kyoto, Japan, 2009, pp. 1436–1443. [15] B. Liu, J. Huang, L. Yang, and C. A. Kulikowski, “Robust tracking using local sparse appearance model and K-selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2011, pp. 1313–1320. [16] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust 1 tracker using accelerated proximal gradient approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2012, pp. 1830–1837. [17] D. Wang and H. Lu, “On-line learning parts-based representation via incremental orthogonal projective non-negative matrix factorization,” Signal Process., vol. 93, no. 6, pp. 1608–1623, 2013. [18] M. A. Turk and A. P. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci., vol. 3, no. 1, pp. 71–86, 2001. [19] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [20] D. Wang, H. Lu, and M.-H. Yang, “Online object tracking with sparse prototypes,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 314–325, Jan. 2013. [21] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via multi-task sparse learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2012, pp. 2042–2049. [22] B. Zhuang, H. Lu, Z. Xiao, and D. Wang, “Visual tracking via discriminative sparse similarity map,” IEEE Trans. Image Process., vol. 23, no. 4, pp. 1872–1881, Apr. 2014. [23] H. Grabner and H. Bischof, “On-line boosting and vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Washington, DC, USA, 2006, pp. 260–267. [24] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 864–877. [25] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, “On-line random forests,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, Kyoto, Japan, 2009, pp. 1393–1400. [26] N. Jiang, W. Liu, and Y. Wu, “Learning adaptive metric for robust visual tracking,” IEEE Trans. Image Process., vol. 20, no. 8, pp. 2288–2300, Aug. 2011. [27] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in Proc. IEEE Int. Conf. Comput. Vis., Barcelona, Spain, 2011, pp. 263–270. [28] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 438–451. [29] J. A. Lasserre, C. M. Bishop, and T. P. Minka, “Principled hybrids of generative and discriminative models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Washington, DC, USA, 2006, pp. 87–94. [30] M. Isard and A. Blake, “Condensation—Conditional density propagation for visual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5–28, 1998. [31] D. Wang, H. Lu, and Y.-W. Chen, “Object tracking by multi-cues spatial pyramid matching,” in Proc. IEEE Int. Conf. Image Process., Hong Kong, 2010, pp. 3957–3960.


[32] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2012, pp. 1822–1829. [33] S. He, Q. Yang, R. W. Lau, J. Wang, and M.-H. Yang, “Visual tracking via locality sensitive histograms,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA, 2013, pp. 2427–2434. [34] D. Wang, H. Lu, and M.-H. Yang, “Least soft-threshold squares tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA, 2013, pp. 2371–2378. [35] A. Singhal, “Modern information retrieval: A brief overview,” Bull. IEEE Comput. Soc. Tech. Committee Data Eng., vol. 24, no. 4, pp. 35–43, Dec. 2001. [36] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA, USA: Pearson Addison-Wesley, 2005. [37] CAVIAR (2005, Mar.). [Online]. Available: http://groups.inf.ed.ac.uk/vision/ CAVIAR/CAVIARDATA1/ [38] R. Yao, Q. Shi, C. Shen, Y. Zhang, and A. van den Hengel, “Partbased visual tracking with online latent structural learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA, 2013, pp. 2363–2370.

Dong Wang received the B.E. degree in electronic information engineering and the Ph.D. degree in signal and information processing, both from the Dalian University of Technology (DUT), Dalian, China, in 2008 and 2013, respectively. He is currently a Faculty Member with the School of Information and Communication Engineering, DUT. His current research interests include face recognition, interactive image segmentation, and object tracking.

Huchuan Lu (SM’12) received the M.Sc. degree in signal and information processing and the Ph.D. degree in system engineering, both from the Dalian University of Technology (DUT), Dalian, China, in 1998 and 2008, respectively. He has been a Faculty Member, since 1998, and a Professor, since 2012, with the School of Information and Communication Engineering, DUT. His current research interests include computer vision, pattern recognition, visual tracking, and segmentation. Prof. Lu currently serves as an Associate Editor of the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS —PART B: C YBERNETICS.

Chunjuan Bo received the B.E. degree in electronic information engineering from Dalian Nationalities University (DLNU), Dalian, China, in 2008 and the M.S. degree in communication and information system from the Dalian University of Technology, Dalian, in 2013, where she is currently pursuing the Ph.D. degree from the School of Information and Communication Engineering. She is currently a Faculty Member with the College of Electromechanical and Information Engineering, DLNU. Her current research interests include image classification and object tracking.

Visual tracking via discriminative sparse similarity map.

Part-based visual tracking via online weighted P-N learning.

Noise suppression for dual-energy CT via penalized weighted least-square optimization with similarity-based regularization.

Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses.

Fluorescence molecular tomography reconstruction via discrete cosine transform-based regularization.

Tracking Rheumatoid Leptomeningitis on Diffusion-weighted MRI.

On weighted visual field indices.

Onboard Robust Visual Tracking for UAVs Using a Reliable Global-Local Object Model.

Robust visual tracking using local sparse appearance model and K-selection.

Learning local appearances with sparse representation for robust and fast visual tracking.

Visual similarity effects in detecting letter rhymes.

Geometric and semantic similarity in visual masking.

Visual similarity effects on masked priming.

Running discrete cosine transform.

Head Tracking of Auditory, Visual, and Audio-Visual Targets.

Guaranteed classification via regularized similarity learning.

Multi-View Structural Local Subspace Tracking.

3D Tracking via Shoe Sensing.

Optimal Appearance Model for Visual Tracking.

Eye Tracking for Personal Visual Analytics.

Real-Time Visual Tracking through Fusion Features.

Multielement visual tracking: attention and perceptual organization.

Visual tracking and the primate flocculus.

Some factors affecting pigeons' visual tracking behavior.