1060

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

Learning to Track Multiple Targets Xiao Liu, Dacheng Tao, Senior Member, IEEE, Mingli Song, Senior Member, IEEE, Luming Zhang, Jiajun Bu, and Chun Chen

Abstract— Monocular multiple-object tracking is a fundamental yet under-addressed computer vision problem. In this paper, we propose a novel learning framework for tracking multiple objects by detection. First, instead of heuristically defining a tracking algorithm, we learn that a discriminative structure prediction model from labeled video data captures the interdependence of multiple influence factors. Given the joint targets state from the last time step and the observation at the current frame, the joint targets state at the current time step can then be inferred by maximizing the joint probability score. Second, our detection results benefit from tracking cues. The traditional detection algorithms need a nonmaximal suppression postprocessing to select a subset from the total detection responses as the final output and a large number of selection mistakes are induced, especially under a congested circumstance. Our method integrates both detection and tracking cues. This integration helps to decrease the postprocessing mistake risk and to improve performance in tracking. Finally, we formulate the entire model training into a convex optimization problem and estimate its parameters using the cutting plane optimization. Experiments show that our method performs effectively in a large variety of scenarios, including pedestrian tracking in crowd scenes and vehicle tracking in congested traffic. Index Terms— Cutting plane, discriminative model, interdependence, learning to track, multiple-object tracking, structure prediction, tracking-by-detection.

I. I NTRODUCTION

V

IDEO analysis plays a critical role in computer vision research. With the advances in sensor manufacturing and the desire to establish smart city, a fully automatic and real-time tracking system is of particular importance. In this paper, we address the problem of automatically tracking a variable number of targets in congested scenes from a single, potentially moving, uncalibrated camera. This is crucial in many applications, for example, traffic surveillance, occupy sensing, and security will benefit from this technique. However, complex real conditions, i.e., clutter and moving

Manuscript received April 11, 2013; revised January 15, 2014 and June 7, 2014; accepted June 24, 2014. Date of publication July 14, 2014; date of current version April 15, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 61170142, in part by the Program of International Science and Technology Cooperation under Grant 2013DFG12840, and in part by the Australian Research Council under Project FT-130101457 and Project DP-120103730. (Corresponding author: Mingli Song.) X. Liu, M. Song, L. Zhang, J. Bu, and C. Chen are with the College of Computer Science, Zhejiang University, Hangzhou 310027, China (e-mail: [email protected]). D. Tao is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2333751

background, inter and intraocclusions and appearance variance continue to make it challenging. Tracking by detection [1]–[7] is an attempt to overcome the aforementioned difficulties. The tracking by detection techniques carry out object detection and then associate the detection results with integral trajectories. Compared with motion segmentation-based tracking [8]–[17], tracking by detection uses more discriminative and flexible appearance information, and is more convenient for handling the initialization and termination of a trajectory. Because of these advantages, tracking by detection achieves state-of-theart performance, and has become one of the most popular tracking methods. An important problem with tracking by detection, however, is that its assumption of reliable detection does not always hold. On one hand, the detector cannot avoid the missing rate (MR) and false positives (FPs), which usually result in tracking failure. Although some previous methods [18]–[23] deal with this problem by focusing on learning and refining the detectors during tracking, most of them are designed for single- but not multiple-object tracking because of the difficulty in initialization and the time complexity. On the other hand, most previous tracking by detection methods take the final detection result after rough postprocessing, i.e., nonmax suppression (NMS). It is a heuristically defined algorithm and independently selects the higher score detection candidates as the final result [24]. The NMS is successful in a static image, when used for tracking, however, this simple strategy totally neglects the interdependence of spatial interaction, temporal continuity, appearance similarity, and scene structure cues in video. This induces a large number of selection mistakes, especially when newcomers appear and during occlusions [4], [7]. To address this instability, global detection responses association is proposed in [5] to reconstruct trajectories. Rough detection cues are captured from both future and past frames. Long trajectories are then built through global estimation. The detection responses from future frames help adjust the local inaccuracy, although this kind of approach cannot be applied to time-crucial scenarios because it relies on predicting the future, which is impossible in an online system. In contrast, particle filter [6], [25] is widely used to represent the tracking uncertainty in a causal manner. A sequential Monte Carlo (SMC) process is offered to continue tracking objects. Tracking states from past frames are considered, and together with current observation, the current tracking state can be inferred by importance sampling. Compared with the global association-based methods, such approaches are more suitable to real-time scenarios. The main challenge for the particle

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

filter-based algorithms is how to predict correct data association between trajectories and unreliable detection outputs. Several recent algorithms combined multiple matching cues for robust tracking performance [1], [7]. They utilized influence factors, such as positions, sizes, similarities, and scene structures as matching evidences and set the weight coefficients of these affecting factors through a heuristic approach, i.e., trial and error. The resulting algorithms are then sensitive to the variation of detection accuracy, crowd density, and scene structure in different tracking scenes. In this paper, instead of pursuing perfect detector, we explicitly consider the detection unreliability in a tracking by detection framework by formulating tracking by detection into a structure prediction problem. Combining different influence factors, we train our tracking by detection model with labeled video data such that the approach supports to learn the most appropriate parameters for specific scenes. We utilize the original detection collection as the detection observation. Compared with the final detection output after NMS, it contains richer information and candidates such that the possibility of getting the best matching is reserved. We then integrate spatial temporal and scene structure cues from the video to decrease the risk of labeling error. In other words, given current detection observations and previous tracking states, our model infers the current tracking state through maximizing the joint probabilistic score. II. R ELATED W ORK A. Online Detector Learning The imperfect of detectors, i.e., unreliable detections, false detections, and misses, is an issue which one has to deal with realistic surveillance situations and many researchers worked on this issue over the last decades. A straightforward countermeasure is to learn or refine classifiers during tracking. For example, Grabner and Bischof [18] proposed an online boost-based feature selection method for tracking which was picked up and extended in several ways [19], [20]. Babenko et al. [21] proposed an online multiple instance learning algorithm for object tracking that achieves superior results with real-time performance. In the same spirit, Kalal et al. [22] proposed tracking-learning-detection framework that explicitly decomposes the long-term tracking task into tracking, learning, and detection. An output-feedback adaptive control method for tracking in linear time-invariant plants is proposed in [26] and extended in [27]. By using the dynamic surface control technique, it is shown that the explosion of complexity problem in multivariable backstepping design can be eliminated. However, although the above methods work well for tracking single target, they are not suitable for multiple-object tracking because of the difficulty in initialization and time complexity. On one hand, the online classifiers are very sensitive to the accuracy of initialization, therefore some methods even initialize the location of target by hand [22]. However, when the number of targets increases, false detections, and misses are more likely to happen such that getting accurate initialization becomes a difficult job. On the other hand, training online classifiers for each targets during

1061

tracking rises heavy computation burdens and it may be hard for these algorithms to achieve real-time performance. B. Tracking by Detection A considerable amount of previous work has addressed the problem of multiple-object tracking. Traditional work relies on feature-based motion segmentation [9]–[13], [15], [17], [28]–[30]. However, motion segmentation is not reliable when the camera is movable, and these methods cannot distinguish the object category of particular interest from other movers. Motivated by the impressive advances in object detection, tracking by detection has recently become popular in this area. It uses a discriminatively trained object detector to overcome the aforementioned limitation. Tracking by detection also makes it easy to handle the initialization and termination of trajectories. There are then two main streams for tracking by detection: 1) global data association and 2) SMC. The global data association algorithms collect visual evidence over a wide time window, from both future and past frames. The detection responses are then associated with tracklets through global estimation. For example, Wu and Nevatia [7] combined an edgelet-based human detector with greedy data matching in a Bayesian framework. Huang et al. [5] extended their approach into a hierarchical version, using a three level model. At the low level, consecutive detection responses are linked. At the middle level, the Hungarian algorithm is used to obtain optimal association. At the high level, a scene structure model is estimated based on the tracklets which helps to construct long-range trajectories. Li et al. [31] then extended Huang’s method. Instead of heuristically selecting parametric models, they selected features and corresponding nonparametric models by maximizing the discriminative power on training data. Leibe et al. [2] formulated data association in a minimum description length hypothesis selection framework. Their method relies on 3-D scene geometric estimation. Zhang et al. [32] mapped the data association problem into a cost-flow network with nonoverlap constraint on trajectories. The optimal data association is found by a min-cost flow algorithm in the network. These global association methods improve their tracking results by using global analysis, but it is at the price of losing the ability of to do online processing. Compared with global association-based tracking, SMC only uses states from previous frames and thus is more suitable for real-time applications. To obtain correct associations robustly with noisy detection responses, SMC-based methods keep generating multiple hypotheses from previous states. As more evidence is collected, hypotheses are supposed to converge. Okuma et al. [6] used boosted [33] particle filter to track multiple targets. Zhao and Nevatia [11] combined an MRF motion prior with particle filter to deal with human interaction. To handle the unreliability of detection, attempts were made to combine multiple calibration cameras [34], 3-D depth estimation [35], and motion behavior model [36] with particle filter. Andriluka et al. [4] designed very complex human detection models to track pedestrians. Kuo et al. [3]

1062

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

learned discriminative appearance models for each target to distinguish between interaction targets. In order to achieve robust tracking with unreliable detection source, a number of particle filter-based methods avoid directly using the final sparse detection output for computing the weights of the particles. Breitenstein et al. [1] explored to use the continuous detection confidence as a graded observation model to track multiple persons. Their approach is based on a combination of a class-specific pedestrian detector to localize people and a particle filter to predict the target locations, incorporating a motion model. C. Detection Postprocessing Existing detectors give repeated responses for a single object, so postprocessing is needed to reconcile the multiple responses, which is usually implemented by NMS. Desai et al. [37] proposed a discriminative model to learn the spatial interaction between different classes of objects. The model learns statistics that capture spatial arrangements of various object classes. Alternatively, Sadeghi and Farhadi [38] decoded detector outputs to produce final results through designing context-aware feature. They argued that their feature representation resulted in a fast and exact inference method. These methods used only static information to construct their models. To the best of our knowledge, there is no existing method that systematically and schematically integrates temporal continuity and scene structure in videos to conduct the reconciliation postprocessing. III. T RACKING F RAMEWORK Our algorithm consists of two phases at each time step: 1) prediction and 2) update. At the prediction phase, the algorithm predicts the labels of the observation based on the previous tracking state. We use the original detection results as observation. Some of these results correspond to existing and newly tracked objects, but there are also repeated detections and FPs. The prediction operation is therefore required to distinguish the observation data. Since the output labels are interdependent, this is a structured prediction problem. To achieve the necessary robustness, we learn a discriminative max-margin model from training instead of heuristically defining a labeling strategy. Given the observation and previous state as input, the model predicts the observation labels by maximizing a potential function. At the update phase, the algorithm uses labeled observation to update the tracking state. We model the update phase in a first-order Markov chain, i.e., the current tracking state only relies on the state of the last frame and the current observation. Fig. 1 shows the processing line of our tracking framework. A. Problem Setting and Notation We use original detection results as observation. Instead of proposing a new domain specific detection model, we simply utilize the code of commonly used part-based model [24]. It should be mentioned that we only need root bounding boxes

Fig. 1. Our tracking processing line. It contains prediction phase and update phase at each time step. At the prediction phase, the algorithm predicts labels of the observation. At the update phase, the algorithm updates the tracking states.

Fig. 2. (a) Threshold 0 is used and only 4/10 objects are successfully detected. (b) Threshold −0.5 is used and 9/10 objects are successfully detected.

as observation for the following steps. Any other detector which can return bounding boxes as the detection results is suitable to take its place. This is a significant advantage over other methods [4], which need a limb detector or body part detector. To detect objects in an image, the local detector evaluates the grid in each position, and each scale, by a filterlike classifier and returns a detection score. A grid with a higher score than a threshold will be regarded as a positive response. Fig. 2(a) and (b) shows the detection responses with different thresholds. We collect the detection results above a relatively low threshold so that all the targets that appear can be observed most of the time. Each detection response at time t is then represented as a 5-tube vector   (X ) (1) X i,t = x i,t , yi,t , si,t , Fi,t , ri,t where (x i,t , yi,t ) indicates the center position of the detected (X ) bounding box, si,t indicates the area of the bounding box, Fi,t is the appearance descriptor, and ri,t is the detection score. Suppose there are M detection results in total at time t. The observation is represented by Xt = {X i,t : i = 1 . . . M}.

(2)

Each tracked object is independently represented as a 6-tube vector   (T ) (3) Ti,t = u i,t , v i,t , u˙ i,t , v˙i,t , si,t , G i,t where (u i,t , v i,t ) indicates its current center position, (u˙ i,t , v˙i,t ) (T ) is the area of the bounding box, indicates its velocity, si,t and G i,t is its appearance descriptor. Suppose there are K activating trajectories at time t. The joint state is represented as Tt = {Ti,t : i = 1 . . . K }. T0 is initialized to be an empty set.

(4)

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

1063

TABLE I TABLE OF N OTATION

Given the previous tracking state Tt −1 and the current observation Xt , we want to assign a label to each detection response X i,t . Assume we have K existing trajectories in the (t − 1)th frame, we write the label z i,t ∈ {0 . . . K + 1}. For 1 ≤ z i,t ≤ K , it means the i th detection response corresponds to an existing trajectory Tzi,t−1 . For z i,t = K + 1, it means the i th detection response initializes a new trajectory. For z i,t = 0, it means that the i th detection response is not related to any trajectory. If z i,t = j , it means the j th trajectory is unseen at the tth frame. Let Z t = {z i,t : i = 1 . . . M} represent the set of labels assigned at time t. For ease of notation, in the rest of this section, X, Z , T are used to denote Xt , Z t , and Tt−1 , respectively. We use bold uppercase letters to indicate matrixes and sets of vectors, normal uppercase letters to indicate vectors, and counting constants, and lowercase letters for other scales. The table of notations is listed in Table I. B. Prediction Phase At the prediction phase of the tth frame, considering the spatial interaction between detection candidates and the temporal continuity between existing trajectories and observations, we want to obtain the optimal label assignment. A score function is defined to evaluate the assignment quality. The assignment score is defined as the sum of the spatial temporal components S(X, Z , T) = Ss (X, Z ) + St (X, Z , T)

(5)

where Ss (X, Z ) captures the spatial interaction between observations and St (X, Z , T) captures the temporal continuity between consecutive frames. A high score means a good assignment quality. Since there are repeated detection responses for the same object, the spatial interaction component evaluates the assignment quality based on the interactional cost between noneliminated detection responses to suppress these repeated candidates. For a specific scene, a newly tracked object tends to appear in some fixed areas; existing tracked objects tend to reappear along their trajectories and to disappear in some fixed areas. The temporal continuity component evaluates the assignment quality based on these factors. 1) Spatial Interaction: Because of the risk of repeated detection, noneliminated detections have spatial interactional suppression on each other. We define a pairwise cost to describe the interaction between two detection responses, and assume that the cost is an arbitrary separable quadratic function of the overlapping area ratios where

C(X i , X j ) = Ws · φs (X i , X j )

(6)

  φs (X i , X j ) = di,2 j , di, j

(7)

is the pairwise interactional relationship between X i and X j , and di, j is the overlapping area ratios of X i and X j . The entire interactional score is the overall cost of all the noneliminated detection pairs  Ws · φs (X i , X j ). (8) Ss (X, Z ) = z i >0,z j >0,z i  =z j

Since Ws is independent of labeling choice, we have Ss (X, Z ) = Ws · ϕs (X, Z ) where ϕs (X, Z ) =

 z i >0,z j >0,z i  =z j

φs (X i , X j ).

(9)

(10)

1064

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

structure prior, it is natural that a candidate with a higher detection score has more confidence to initialize a new trajectory. Thus, Sti (X i ) considers the influence of scene structure (I ) prior j =1...L π j N (x i , yi |μ j , σ j ) and detection confidence prior ri . In general, the detection score is not in direct proportion to detection confidence, so we make it an offset form  (I ) π j N (x i , yi |μ j , σ j ) + δri . (12) Sti (X i ) = Fig. 3. Estimation of position effect using the GMM on intensity. The left column shows the choice of fixed points. The middle column shows the reconstruction of the initialization position effect. The right column shows the reconstruction of the termination position effect.

2) Temporal Continuity: Existing tracking methods benefit from using the detection technique. We argue that detection— especially the postprocessing step—can also benefit from tracking. This is because videos contain much more information than static images. By learning the tracking state of previous frames, it is possible to predict the content in the current frame. Since the labeling assignment of observation is closely related to the initialization, growth, and termination of trajectories, we discuss temporal continuity in each of these three cases. a) Initialization and termination: Most previous methods [3]–[5], [39] simply treat detection response with high detection confidence as a new appeared target, unless it has been associated with an existing trajectory. However, in practice, many detection responses with high confidence are FPs, so detection confidence alone is not enough to make an accurate decision. In this paper, we judge whether a detection candidate is a new target based on both detection confidence and scene structure. We assume that in a specific scene, the position of newcomers will follow a specific distribution. In other words, general entrance exists in the camera view. This assumption holds in many tracking scenes, e.g., surveillance cameras. In fact, there is no need to acquire the entrance positions. We want to measure the probability of appearances of new objects based on their positions. A Gaussian mixture model (GMM) is suitable for estimating the scene structure influence, since it can utilize small-scale data to fast and accurately approximate the unknown distribution of newcomers in different positions while avoiding pixel-level overfitting at the same time  (I )   (11) π j N x i , yi |μ j , σ j2 p(X i ) = j =1...L (I )

where π j , μ j , and σ j are weight, mean position, and standard derivation of the j th Gaussian component, respectively. To simplify model learning, we uniformly sample the Gaussian component over the camera view, and set the standard derivation as the distance between adjacent Gaussian components. ) π (I j will be estimated as the only unknown model parameter. Fig. 3 illustrates the probability of the appearance of new objects in different positions, which is consistent with the facts. Our method evaluates a function Sti (X i ) for the quality of labeling X i as a new object. In addition to the scene

j =1...L

Similarly, with initialization, we use GMM to estimate the termination probabilities based on the positions of the trajectories  (T ) Stt (Ti ) = π j N (u i , v i |μ j , σ j ). (13) j =1...L

Fig. 3 illustrates the probability of trajectory termination at different positions. b) Trajectory growth: Trajectory growth involves the association between an existing tracked object and new tracked object. Similar to that in [7], a link association score is defined as the product of three affinities based on position, size, and appearance Stg (X i , T j ) = β Apos(X i , T j )Asize (X i , T j )Aapp (X i , T j ). (14) Apos, Asize and Aapp are normalized affinties

(u j + u˙ j − x i )2 Apos(X i , T j ) = exp − 2 σpx (v j + v˙ j − yi )2 + 2 σpy ⎡ ⎛ ⎞⎤ (T ) (X ) 2 s j − si ⎠⎦ Asize (X i , T j ) = exp ⎣− ⎝ σs2 Aapp (x i , T j ) = BC(Fi , G j )

(15)

(16) (17)

where σpx , σpy , and σs are scale parameters and BC(Fi , G j ) =

Q  

Fi,(l) , G j,(l)

(18)

l=1

is the Bhattacharyya coefficient between two histograms, and Q is the length of the appearance descriptor. This definition suggests that the position, size, and appearance of objects between consecutive frames should be similar. c) Joint temporal continuity score: We follow the nonoverlap assumption [5]: a detection response can only belong to one trajectory. In addition, we assume that a detection response can only belong to one of the three states: 1) eliminated detection; 2) new target; and 3) existing trajectory. We then have the joint temporal continuity score  Sti (X i ) St (X, Z , T) = z i =K +1

+



Stg (X i , Tzi ) +

0 0, z i = z j



Stt (T j )

z i = j

(19)

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

1065

Note that St (X, Z , T) can be expressed as the dot product of two vectors St (X, Z , T) = Wt · ϕt (X, Z , T)

(20)

Wt = [π1(I ) . . . π L(I ) , δ, β, π1(T ) . . . π L(T ) ] parameters and ϕt (X, Z , T) =

where contains the model [ϕt i (X, Z ), ϕtg (X, Z , T), ϕt t (T, Z )] is the data specific feature vector. ϕt i (X, Z ), ϕtg (X, Z , T) and ϕt t (T, Z ) are defined as follows:  ϕt i (X, Z ) = φt i (X i ) (21) z i =K +1

ϕtg(X, Z , T) =



φtg (X i , Tzi )

(22)

0 0.5 ∧ z i = h i ] : otherwise

is

(33)

the first condition is for mis-detection, and an inexact label will give a penalty. The second condition is for FPs, and if there is no true positive near enough, it will give a penalty. However, we argue that for the first condition, a little spatial displacement should not be regarded as mis-detection. For the second condition, repeated detection is not handled in ov (z i , h i ). See Fig. 4, ov gives (c)–(e) the same penalty, while obviously (d) is better. To handle these problems, we consider a more appropriate loss function   rp (Z , H ) = rp (z i , H ) + rp (Z , h i ) (34) z i  =0

where rp (z i , H ) =



h i  =0

1 :  j s.t. [ov(i, j ) > 0.5 ∧ z i = h j ] (35) 0 : otherwise

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

and

⎧ 0 ⎪ ⎪ ⎨ 0.1 rp (Z , h i ) = ⎪ ⎪ ⎩ 1

1067

: z j = hi : z j = h i ∧ ∃ j (36) s.t. [ov(i, j ) > 0.5 ∧ z j = h i ] : otherwise

in rp (z i , H ), we use soft matching to evaluate misdetection, and the penalty is given only when there is no nearby true positive. In rp (Z , h i ), there are three conditions. If it is an exact match then there is no penalty. If it can be soft matched to a true positive, then we treat it as a repeated response and give a small loss penalty. Otherwise, it is a FP and is given a large loss score. The comparison of these three loss functions are shown in Fig. 4. B. Cutting Plane Convex Training It is easy to show that the optimization of (29) is equivalent to optimizing min J (W ) = W

1 ||W ||2 + λR(W ) 2

(37)

where R(W ) =

N 

Fig. 5. Precision/recall curves for pedestrian detection on CAVIAR data set. We show results for full version model and no-scene prior model. Our method clearly outperforms NMS [24] and Desai et al.’s method [37].

since R(W ) is always larger than 0 so R(W ) can be lower bounded by Ri (W ) = max(a j W + b j )

Wi+1 = argmin Ji (W )

H

R(W ) is convex because the maximum of two linear functions is convex and the sum of several convex functions is also convex. Thus, we have first-order Taylor lower bound approximation for R(W ) (39)

where ∂W R(W0 ) is the subgradient of R(W ) at W0 , which is ∂W R(W0 ) = −

N 

(44)

W

−W (ϕ(Xn , Z n , Tn−1 ) − ϕ(Xn , H, Tn−1 ))) . (38)

R(W ) ≥ R(W0 ) + ∂W R(W0 )(W − W0 ) ∀W, W0

(43)

where a0 and b0 are set to be 0, respectively. Instead of minimizing J (W ) directly, cutting plane methods minimize it approximately by iteratively solving a quadratic program arising from its lower bound [40]

max (0, (Z n , H )

i=1

j = 0...i

  τn ϕ(Xn , Z n , Tn−1 − ϕ(Xn , Hn∗, Tn−1 )

n=1

(40) and

⎧ ⎫ ∗ ⎨ 1 : (Z n , H ⎬ n)  −W0 ϕ(Xn , Z n , Tn−1 ) − ϕ(Xn , Hn∗ , Tn−1 ) τn = ⎩ ⎭ 0 : otherwise (41) where Hn∗ is the most violated constraint Hn∗ = argmax ( (Z n , H ) H

−W0 (ϕ(Xn , Z n , Tn−1 ) − ϕ(Xn , H, Tn−1 ))) . (42) The above equation means that R(W ) can be lower bounded by a piecewise linear function at any location (say W0 ) called a cutting plane [41]. In addition, given a set of locations {W1 , . . . , Wn }, R(W ) can be well approximated by a set of cutting planes. Denote Wi the i th approximated point, ai = ∂W R(Wi ) and bi = R(Wi ) − ∂W R(Wi )Wi , and

where 1 ||W ||2 + λRi (W ). (45) 2 This is a quadratic programming problem which can be rewritten into a constrained optimization problem and further into its dual form λ min D(α) = α T AT Aα − α T B α 2 s.t. α ≥ 0 and 1T α ≤ 1 (46) Ji (W ) =

where A denotes the matrix [a1 . . . ai ], and B the vector [b1 . . . bi ]T ; 1T denotes a row vector which contains all ones and W = −λAα. Equation (49) can be solved with the publicly available simplex solver from [42]. V. E XPERIMENTS A. Detection Evaluation To quantitatively evaluate the impact of utilizing video cues on detection, we compare the detection results of our method with several other detection postprocessing results, such as NMS and Desai et al.’s method [37], on the CAVIAR data set [43]. The CAVIAR shopping center corridor view consists of 26 sequences of more than 35 000 frames. One of the sequences is used for training the human detector and the others for testing. All methods use the same original detection candidates as the input of the postprocessing steps. We handlabeled the ground-truth bounding boxes for each video. A predicted bounding box is considered correct if it has more than 50% overlap with a ground-truth, or it is a FP. By changing the detection threshold, we can obtain a precisionrecall curve for each method.

1068

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

Fig. 6. Detection samples of NMS (the top row) and our method (the bottom row) on CAVIAR. Although both of them use the same detector, the proposed method utilizes temporal continuity to get accurate detection results while NMS is troubled by interocclusion.

Fig. 5 shows the precision/recall curves. The proposed method clearly outperforms NMS and Desai et al.’s method [37], largely due to the spatial temporal cues from the video which help to raise detection robustness. Fig. 6 shows a typical situation in which NMS and Desai’s method are incapable of performing because of serious interocclusion, while our method utilizes temporal continuity to predict the correct bounding boxes. To investigate the impact of the scene structure prior, we also evaluate our methods with only detection confidence to decide whether new targets are detected. In contrast to the complete version, it involves more FPs and causes precision drop due to background cluttering. B. Tracking Evaluation We evaluate our tracking framework on four sequences: 1) CAVIAR; 2) i-LIDS AVSS AB [44]; 3) i-Lids AVSS PV [44]; and 4) our congested traffic data set. The former two are used to test pedestrian tracking and the latter two are used to test vehicle tracking. We divide each data set into two disjoint parts, one for training and the other for testing. None of the trajectories or targets that appear in the testing part is used for training. In all sequences, we only use 2-D video information and do not assume any calibration or entry/exit zone. All latent parameters are set experimentally. We compare our method (structure prediction) with representative state-of-the-arts [1], [3], [5], [7] on the four data set. To quantitatively analyze the evaluation results of different methods, we use CLEAR MOT metrics [45] which return a precision score MOTP and an accuracy score MOTA. We also report the MR, ratio of FP, and identity switches (IDS). We first find a continuous mapping between the hypothesis and the ground-truths in each frame based on [45], and then estimate each of the evaluation metric as follows. Considering an object-hypothesis pair, the precision score is defined as the intersection over union (the intersection area/the union area) of their bounding boxes. And, the multiple-object tracking precision (MOTP) is the score averaged by the total number of matches. A higher MOTP score means a better precision. Multiple-object tracking accuracy (MOTA), which is the ratio of correct tracking, while misses, FPs, and IDS are all considered wrong tracking.

The tracking miss occurs when an object is annotated but the intersection over union of the object-hypothesis pair is smaller than 0.5. The FP occurs when a hypothesis is not matched to any ground-truth object. The IDS occurs when the labels of two consecutive hypotheses are not consistent. Smaller numbers of tracking misses, FPs, and IDS mean better tacking performance. 1) Results on CAVIAR: The CAVIAR data set consists of sequences from two different views. Similar to [7], we use 26 sequences of the shopping center corridor view, including 36 292 frames. The sequences are 24 frames/s, in which each frame is of size 384 × 288. The average length of a sequence is about 100 s and the entire data set is about 210 MB in the MPEG-4 format. We use two sequences for training and the others for testing. This data set is challenging because there are many interocclusions between humans which will cause failures in the detection step. Our method is particularly effective for this problem. Another challenge arises from the fact that some targets may enter a door in the scene and reappear after a long time. By keeping a hang-up trajectory state (alive but unseen), this problem can be partially resolved. As can be seen in Table II, our method outperforms others even though our algorithm does not use global optimization. Some samples and results are shown in Fig. 7. 2) Results on i-Lids AVSS AB: The i-Lids AVSS AB data set contains three videos captured at the same subway station, and has 135 ground-truth trajectories. The videos are categorized into three levels of difficulties based on the density of pedestrian flow. All the sequences are 24 frames/s, in which each frame is of size 720 × 576. The average length of a sequence is about 200 s and the entire data set is about 300 MB in the AVI format. Similar to CAVIAR, interhuman occlusion and scene occlusion make the data set challenging for detector and tracker. Compared with CAVIAR, the occlusion is much more serious in this sequence. When the train arrives many people walk toward the subway and they are very close to one another. The detector cannot avoid mis-detecting some of them. In our model, the trajectory growth factor tends to retain the spatial continuity of the trajectory. Although detection responses are close, the algorithm can link optimal responses to the existing trajectories. We use one of the videos, i-Lids AVSS AB Medium, for training and the other two videos for testing. Our method achieves a comparable MOTP score and

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

Fig. 7.

1069

Sample frames of the proposed tracking method. More results are given in the video demo.

the best MOTA score among state of the arts. Samples and results are shown in Fig. 7. 3) Results on i-Lids AVSS PV: We report our car tracking results on i-Lids AVSS PV. The data set contains three videos captured from the same traffic surveillance camera. Vehicles move toward a two-lane road. There are several entrances and exits, and vehicles sometimes park along the road. The videos are categorized into three levels of difficulties based on the density of traffic flow. All the sequences are 24 frames/s, in which each frame is of size 720 × 576. The average length of a sequence is about 200 s and the entire data set is about 300 MB in the AVI format. We use i-Lids AVSS PV Medium for training and the other two videos for testing. As Table II shows, our method achieves very high MOTP and MOTA scores. However, misses and FPs still exist because the detector is confused when the vehicles are small. Samples and results are shown in Fig. 7. 4) Results on Congested Traffic: In general, pedestrian tracking is more challenging than vehicle tracking. The detection of pedestrians is more difficult and occlusion between pedestrians is more serious. However, tracking multiple vehicles in a crowded street view is still under-addressed. We collect a data set called congested traffic which contains five clips of videos captured from traffic surveillance cameras on a congested street. This data set has very high traffic density, moving noises (pedestrians and bicycles), and serious occlusions. All these make it a challenge to track vehicles in the sequence. The length of each video is more than one hour. All the sequences are 24 frames/s, in which each frame is of size 255 × 288. For each video, we pick 5-min video to train and 5-min video to test. The interval between the training and test video is more than 20 min. We report the tracking results in Table II. Compared with the state-of-the-art methods, our

algorithm achieves the best result. Samples and results are shown in Fig. 7. 5) Partial Order Comparisons: In order to confirm the superior performance of the proposed method in respect to other comparing algorithms, a partial order is defined on the set of comparing algorithms for each evaluation metric, where A1 A2 means the performance of algorithm A1 is statistically better than that of A2 on the specific metric based on twotailed paired t-test [46] at 20% significance level. The partial order comparisons are summarized in Table III. Since the partial order only measures the relative performance between two algorithms on a specific metric, an algorithm A1 may perform better than A2 on one metric but worse on another metric. To view the overall performance of an algorithm, we assign an overall score to each algorithm to take into all the evaluation metrics into account. Specifically, for each evaluation metric, if A1 A2, then A1 will gain a positive score +1 and A2 will gain a negative score −1. Accumulating the scores on all the evaluation metrics, we can compare the relative performance of any pair of algorithms. C. Time Complexity Analysis Multiple target tracking is time sensitive, so it is necessary to consider the computational burden. An important expenditure of time in the tracking-by-detection system is the detection stage. A naive MATLAB implementation of the part-based model takes about 1 s per frame for detection. However, through combining with the cascade acceleration [47] and the multiscale gradient histograms approximation scheme [48], the detection time per frame can be reduced to 2 milliseconds [49]. So, we focus on the time complexity analysis of

1070

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

TABLE II C OMPARISON R ESULTS M EASURED BY THE CLEAR M ETRIC , I NCLUDING P RECISION (MOTP), A CCURACY (MOTA), MR, R ATIO OF FP S , AND IDS

TABLE III R ELATIVE P ERFORMANCE B ETWEEN E ACH T RACKING A LGORITHM

the proposed tracking framework which contains prediction phase and update phase at each time step. The algorithm of the prediction phase is listed in Algorithm 1. Suppose there are K activating trajectories Tt −1 , and there are M detection candidates in Tt , Line 1 counts for M(K + 2) units of time. Lines 3 and 4 count for three units per time executed (one addition, one minus, one assignment) and are executed M K times. Line 2 has the hidden costs of (M + 1)(K + 1) units. Hence, Lines 2–5 cost 4M K + 7M + K + 1 in total. Similarly, Lines 6–9 cost 3M K + 4M + K + 2 units. Line 11 counts for M K units. Lines 12–14 counts for three units. Line 15 counts for M units. Lines 17–20 count for 3M K + 4M + K + 2 units. Line 22 counts for 1 unit. In the worst case, the repeat procedure runs M times, so the total costs through Lines 10–22 are 4M 2 K + 5M 2 + M K + 6 time units. The entire cost of the prediction phase is 4M 2 K + 5M 2 + 9M K + 13M + 2K + 9. Thus, the running time complexity is O(M 2 K ). In the update phase, the old trajectories are updated to the new states,

and it is easy to show the time complexity is O(M + K ). Since K and M are usually not too large, the whole tracking framework runs very fast, taking about 5 ms per frame. Hence, our algorithm can achieve real-time tracking in the videos with 12 frames/s. The particle filter-based tracking algorithm [1] spends running time mainly on particle updating. In each time step, the time complexity of particle updating is O(M 2 K ), where K is the number of detection candidates and M is the number of particles—-this setting is the same as ours. Expectation maximization (EM) is employed in [3] and [5] for highlevel trajectory association. It is known that EM usually takes a large number of iterations to reach an acceptable approximation. Hence, [3] and [5] are not capable to handle real-time applications in practice. The running time complexity of [7] is also O(M 2 K ), in which the update and prediction phases are similar to the proposed scheme. D. Robustness Analysis Since the proposed tracking framework is an incremental system, in which the current tracking results depend on the tracking accuracy of the last time step, it is necessary to consider the tracking failure and the system robustness. To this end, we randomly initialize some trajectories (most are wrong), and run our method to see its performance. Fig. 8 shows tracking results of frames in the congested traffic data set. In Fig. 8(a), six trajectories, shown in colorful bounding boxes, are randomly initialized; four of them are far from any

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

Fig. 8. Tracking results of the proposed framework on congested traffic data set when given a bad initialization. (a)–(d) First, fifth, tenth, and fifteenth frames of the video sequence.

1071

tracking robustness; and 3) we formulate it into a maxmargin structure prediction problem. The parameters of the prediction model is learned from labeled trajectories data. A well suited margin-rescaling measure is designed to assess the labelling result. Experiments on several data sets show that our method achieves good performance on variety of surveillance scenarios. The proposed method can deal with situations where the camera moves in a limited region, e.g., in the situations of traffic surveillance and occupy sensing, because the mixture structure of a small size of different scenes can be jointly learned by our GMM. However, we also notice that it cannot deal with the situation where the position of the camera changes in a large area, e.g., the situation of autonomous driving. In the future, we would like to explore updating the prediction parameters with an online optimizer, such that we can adjust the weights of different affecting factors during tracking. Another extension would be to utilize stronger object classifiers for measuring the appearance similarity between detection candidates. R EFERENCES

Fig. 9. MOTA score of our method over sequential 15 frames since a bad initialization.

vehicles, two of them (the black and yellow ones) overlap vehicles, but the positions are not accurate. In Fig. 8(b), our algorithm successfully tracks 11 vehicles, and the positions of the black and yellow bounding boxes are adjusted. The other four bounding boxes that are wrongly initialized are kept for the consideration of temporary occlusion. In Fig. 8(c) and (d), almost all the vehicles are successfully tracked and all the wrongly initialized trajectories are cleared. The MOTA score of our method over sequential 15 frames since a bad initialization are shown in Fig. 9. It can be seen that our method can fast adjust itself to the correct tracking states; hence, it is robust to tracking failure. VI. C ONCLUSION In this paper, we present a novel online tracking-bydetection framework that continues to track multiple-object through alternatively predicting the labels of detection observation and updating the tracking state. The main challenge for this algorithm is how to predict correct matching with unreliable detection result. We make three key contributions for solving this problem: 1) we utilize the original detection collection which contains more candidates than the final detection output such that the possibility of getting the best matching is preserved; 2) cues from spatial interaction, temporal continuity, detection observation, appearance similarity, and scene structure are integrated to achieve the

[1] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Online multiperson tracking-by-detection from a single, uncalibrated camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 9, pp. 1820–1833, Sep. 2011. [2] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool, “Coupled object detection and tracking from static cameras and moving vehicles,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 10, pp. 1683–1698, Oct. 2008. [3] C.-H. Kuo, C. Huang, and R. Nevatia, “Multi-target tracking by online learned discriminative appearance models,” in Proc. 23th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2010, pp. 685–692. [4] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” in Proc. 21st IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2008, pp. 1–8. [5] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” in Proc. 10th Eur. Conf. Comput. Vis. (ECCV), Marseille, France, Oct. 2008, pp. 788–801. [6] K. Okuma, A. Taleghani, N. D. Freitas, J. J. Little, and D. G. Lowe, “A boosted particle filter: Multitarget detection and tracking,” in Proc. 8th Eur. Conf. Comput. Vis. (ECCV), May 2004, pp. 28–39. [7] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors,” Int. J. Comput. Vis., vol. 75, no. 2, pp. 247–266, Jan. 2007. [8] L. I. Perlovsky and R. W. Deming, “Neural networks for improved tracking,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 1854–1857, Nov. 2007. [9] W. Qu and D. Schonfeld, “Real-time decentralized articulated motion analysis and object tracking from videos,” IEEE Trans. Image Process., vol. 16, no. 8, pp. 2129–2137, Aug. 2007. [10] A. S. Poznyak, W. Yu, E. N. Sänchez, and J. P. Përez, “Nonlinear adaptive trajectory tracking using dynamic neural networks,” IEEE Trans. Neural Netw., vol. 10, no. 6, pp. 1402–1411, Nov. 1999. [11] T. Zhao and R. Nevatia, “Tracking multiple humans in crowded environment,” in Proc. 17th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 2. Washington, DC, USA, Jun./Jul. 2004, pp. 406–413. [12] X. Song, J. Cui, H. Zha, and H. Zhao, “Vision-based multiple interacting targets tracking via on-line supervised learning,” in Proc. 10th Eur. Conf. Comput. Vis. (ECCV), Marseille, France, Oct. 2008, pp. 642–655. [13] J. Kwon and K. M. Lee, “Tracking by sampling trackers,” in Proc. 13th IEEE Int. Conf. Comput. Vis. (ICCV), Barcelona, Spain, Nov. 2011, pp. 1195–1202. [14] C. Gentile, O. Camps, and M. Sznaier, “Segmentation for robust tracking in the presence of severe occlusion,” IEEE Trans. Image Process., vol. 13, no. 2, pp. 166–178, Feb. 2004.

1072

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 5, MAY 2015

[15] M. Isard and J. MacCormick, “BraMBLe: A Bayesian multiple-blob tracker,” in Proc. 8th IEEE Int. Conf. Comput. Vis. (ICCV), vol. 2. Vancouver, BC, Canada, Jul. 2001, pp. 34–41. [16] S. Das, A. Kale, and N. Vaswani, “Particle filter with a mode tracker for visual tracking across illumination changes,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 2340–2346, Apr. 2012. [17] W. Hu, X. Li, X. Zhang, X. Shi, S. Maybank, and Z. Zhang, “Incremental tensor subspace learning and its applications to foreground segmentation and tracking,” Int. J. Comput. Vis., vol. 91, no. 3, pp. 303–327, Feb. 2011. [18] H. Grabner and H. Bischof, “On-line boosting and vision,” in Proc. 19th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), New York, NY, USA, Jun. 2006, pp. 260–267. [19] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via online boosting,” in Proc. 17th Brit. Mach. Vis. Conf. (BMVC), Edinburgh, Scotland, Sep. 2006, pp. 47–56. [20] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line boosting for robust tracking,” in Proc. 10th Eur. Conf. Comput. Vis. (ECCV), Marseille, France, Oct. 2008, pp. 234–247. [21] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011. [22] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 6, no. 1, pp. 1–14, Jan. 2010. [23] D. Nguyen-Tuong and J. Peters, “Online kernel-based learning for taskspace tracking robot control,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1417–1425, Sep. 2012. [24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645, Sep. 2010. [25] X. Mei and H. Ling, “Robust visual tracking using L1 minimization,” in Proc. 12th IEEE Int. Conf. Comput. Vis. (ICCV), Kyoto, Japan, Sep./Oct. 2009, pp. 1436–1443. [26] C. Wang and Y. Lin, “Adaptive dynamic surface control for linear multivariable systems,” Automatica, vol. 46, pp. 1703–1711, Oct. 2010. [27] C. Wang and Y. Lin, “Multivariable adaptive backstepping control: A norm estimation approach,” IEEE Trans. Autom. Control, vol. 57, no. 4, pp. 989–995, Sep. 2012. [28] F. Yang and M. Paindavoine, “Implementation of an RBF neural network on embedded systems: Real-time face tracking and identity verification,” IEEE Trans. Neural Netw., vol. 14, no. 5, pp. 1162–1175, Sep. 2003. [29] M. Yang, G. Hua, and Y. Wu, “Context-aware visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 7, pp. 1195–1209, Jul. 2009. [30] H. Wang, D. Suter, K. Schindler, and C. Shen, “Adaptive object tracking based on an effective appearance filter,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 9, pp. 1661–1667, Sep. 2007. [31] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: HybridBoosted multi-target tracker for crowded scene,” in Proc. 23rd IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Miami, FL, USA, Jun. 2009, pp. 2953–2960. [32] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multiobject tracking using network flows,” in Proc. 21st IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2008, pp. 9–16. [33] M. Liu and B. C. Vemuri, “Robust and efficient regularized boosting using total Bregman divergence,” in Proc. 24th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, Jun. 2011, pp. 2897–2902. [34] K. Kim and L. S. Davis, “Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering,” in Proc. 9th Eur. Conf. Comput. Vis. (ECCV), Graz, Austria, May 2006, pp. 98–109. [35] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “A mobile vision system for robust multi-person tracking,” in Proc. 21st IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Anchorage, AK, USA, Jun. 2008, pp. 17–24. [36] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3D pose estimation and tracking by detection,” in Proc. 23th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2010, pp. 623–630. [37] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative models for multi-class object layout,” in Proc. 12th IEEE Int. Conf. Comput. Vis. (ICCV), Kyoto, Japan, Sep./Oct. 2009, pp. 229–236.

[38] M. A. Sadeghi and A. Farhadi, “Recognition using visual phrases,” in Proc. 24th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, Jun. 2011, pp. 1745–1752. [39] S. I. Yu, Y. Yang, and A. Hauptmann, “Harry Potter’s Marauder’s Map: Localizing and tracking multiple persons-of-interest by nonnegative discretization,” in Proc. 26th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Portland, OR, USA, Jun. 2013, pp. 3714–3720. [40] C. H. Teo, A. Smola, S. V. N. Vishwanathan, and Q. V. Le, “A scalable modular convex solver for regularized risk minimization,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (SIGKDD), New York, NY, USA, Aug. 2007, pp. 727–736. [41] T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural SVMs,” Mach. Learn., vol. 77, no. 1, pp. 27–59, Oct. 2009. [42] (2007). Library for Quadratic Programming [Online]. Available: http://cmp.felk.cvut.cz/∼xfrancv/libqp/html/ [43] (2004). CAVIAR Dataset [Online]. Available: http://homepages.inf. ed.ac.uk/rbf/CAVIAR/ [44] (2007). i-Lids Dataset [Online]. Available: http://www.eecs.qmul.ac. uk/∼andrea/avss2007_d.html [45] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” J. Image Video Process., vol. 2008, pp. 1–10, Feb. 2008. [46] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048, Jul. 2007. [47] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in Proc. 23th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), San Francisco, CA, USA, Jun. 2010, pp. 2241–2248. [48] P. Dollár, S. Belongie, and P. Perona, “The fastest pedestrian detector in the west,” in Proc. 21th Brit. Mach. Vis. Conf. (BMVC), Aug. 2010, pp. 68.1–68.11. [49] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, “Pedestrian detection at 100 frames per second,” in Proc. 25th IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, Jun. 2012, pp. 2903–2910.

Xiao Liu is currently pursuing the Ph.D. degree with the Microsoft Visual Perception Laboratory, Zhejiang University, Hangzhou, China. His current research interests include surveillance and learning system.

Dacheng Tao (M’07–SM’12) is a Professor of Computer Science with the Centre for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology at the University of Technology at Sydney, Ultimo, NSW, Australia. He mainly applies statistics and mathematics for data analysis problems in data mining, computer vision, machine learning, multimedia, and video surveillance. He has authored and co-authored more than 100 scientific articles at top venues, including the IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON I MAGE P ROCESSING , the Conference on Neural Information Processing Systems, the International Conference on Machine Learning, the International Conference on Artificial Intelligence and Statistics, the IEEE International Conference on Data Mining series, the Computer Vision and Pattern Recognition Conference, the International Conference on Computer Vision, the European Conference on Computer Vision, the ACM Transactions on Knowledge Discovery from Data, the ACM Multimedia Conference, and the ACM Conference on Knowledge Discovery and Data Mining. Dr. Tao was a recipient of the Best Theory/Algorithm Paper Runner Up Award in IEEE ICDM’07 and the Best Student Paper Award in IEEE ICDM’13.

LIU et al.: LEARNING TO TRACK MULTIPLE TARGETS

1073

Mingli Song (M’06–SM’13) received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, in 2006. He is an Associate Professor with the Microsoft Visual Perception Laboratory, Zhejiang University. His current research interests include face modeling and facial expression analysis. Dr. Song was a recipient of the Microsoft Research Fellowship in 2004.

Jiajun Bu is a Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. His current research interests include information retrieval, computer vision, and embedded system.

Luming Zhang is currently pursuing the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China. His current research interests include visual perception analysis, image enhancement, and pattern recognition.

Chun Chen is a Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. His current research interests include computer vision, computer graphics, and embedded technology.

Learning to track multiple targets.

Monocular multiple-object tracking is a fundamental yet under-addressed computer vision problem. In this paper, we propose a novel learning framework ...
2MB Sizes 5 Downloads 3 Views