1210

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

An Unsupervised Feature Selection Dynamic Mixture Model for Motion Segmentation Thanh Minh Nguyen and Qingming Jonathan Wu, Senior Member, IEEE

Abstract— The automatic clustering of time-varying characteristics and phenomena in natural scenes has recently received great attention. While there exist many algorithms for motion segmentation, an important issue arising from these studies concerns that for which attributes of the data should be used to cluster phenomena with a certain repetitiveness in both space and time. It is difficult because there is no knowledge about the labels of the phenomena to guide the search. In this paper, we present a feature selection dynamic mixture model for motion segmentation. The advantage of our method is that it is intuitively appealing, avoiding any combinatorial search, and allowing us to prune the feature set. Numerical experiments on various phenomena are conducted. The performance of the proposed model is compared with that of other motion segmentation algorithms, demonstrating the robustness and accuracy of our method. Index Terms— Feature selection dynamic mixture model (FSDTM), linear dynamical system (LDS), Kalman filter, dynamic texture segmentation.

I. I NTRODUCTION

D

YNAMIC texture is an extension of texture to the temporal domain and is very common in the real world, such as smoke, waves, clouds, different movements, and even shadows. The study of dynamic texture [1]–[3] has attracted growing attention and is a recent research topic and is one of the heated issues in almost every task of video processing. The objective of dynamic texture segmentation is to cluster time-varying characteristics and phenomena with a certain repetitiveness in both space and time into several groups and assign a unique label to each group. A correct segmentation result provides more information for diagnosis. However, the complexity of dynamic textures, especially for outdoor scenes, may result in inaccurate segmentation. Up until now, many algorithms have been developed for dynamic texture segmentation, in particular using the method of optical flow [4], [5]. The main advantage of these models is in their capability to use prior knowledge based on the fact that the optical flow field could be used for motion segmentation. However, the major disadvantage is that they do not provide

Manuscript received February 9, 2013; revised May 15, 2013, August 14, 2013, and November 6, 2013; accepted January 3, 2014. Date of publication January 16, 2014; date of current version February 4, 2014. This work was supported by the Natural Sciences and Engineering Research Council of Canada. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jung C. Ye. The authors are with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON N9B 3P4, Canada (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2300811

any information about the objects that compose the scene. In order to improve the robustness of the algorithm, the global cues with local representations in [6], [7] model videos as a superposition of layers subject to homogeneous motion. Their primary disadvantage, however, lies in their additional affine transforms, which assume a piecewise planar world that rarely holds in the real world. During the last few decades, a number of algorithms based on probability theory have been proposed for segmentation. In this group, the Gaussian mixture model (GMM) [8], [9] is a well-known method that has been widely used due to its simplicity and ease of implementation. Its success is attributed to the fact that the model parameters can be efficiently estimated by adopting the expectation maximization (EM) algorithm [10]. Many researchers have used it to study a number of key problems in the area of segmentation [11]–[13]. However, the major disadvantage is that these methods are founded mostly upon algorithms for segmenting static images. There are many time-varying phenomena with a certain repetitiveness in both space and time in the real world. In order to overcome this problem, in [14], the generalized principal component analysis (GPCA) is proposed to segment a video by clustering pixels with similar trajectories in time. One advantage of this method is that there has been an effort to view video sequences as dynamic textures, where a sample from stochastic processes is defined over space and time. In [15], an approach has been proposed for clustering the phenomena by viewing video sequences as dynamic textures. The approach has been shown to have a great potential for motion segmentation. However, in this model, a perceptual decomposition into more than two regions is not supported. To eliminate this problem, a dynamic texture model (DTM) has been proposed in [16], [17]. In this model, a collection of sequences is modeled as samples from a set of underlying dynamic textures. Where a video is converted to grayscale and is represented as a collection of localized spatiotemporal observations. The main objective is to segment a video consisting of N patches into K labels. The main advantage of this model is it provides a natural way to cluster video sequences based on the components of the mixture that generated them. Many researchers have extended the model to study a number of key problems in the area of phenomena clustering such as music segmentation [18] and motion segmentation [19]. One drawback of the above-mentioned dynamic mixture model is that there is only one set of dynamics for the whole observation vector. In order to overcome this problem, recently, dynamic fuzzy clustering (DFC) with the Kullback-Leibler divergence information has been

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

introduced [20] by adding separate models for each feature. Their successes are attributed to the fact that observations in a video such as lightness, color, and other texture descriptors could be used for motion segmentation. However, the major disadvantage of this method is that it considers all available features with equal weight of the dataset. In practice, this is not always true as some of the features might be more useful and can improve the results learned from limited amounts of data, while others may be irrelevant for modeling. Motivated by the aforementioned observations, we present a feature selection dynamic mixture model (FSDTM) for motion segmentation. The advantage of our method is that it is intuitively appealing, avoiding any combinatorial search, and allowing us to prune the feature set. The proposed method in this paper, consider all available features with different weights. After parameter learning, the saliencies (weights) of the irrelevant features go to zero, allowing us to prune these irrelevant features. The important features are selected after the parameter learning. For this reason, it helps us improve the segmentation result. The performance of the proposed model is applied for segmenting both simulated timevarying characteristics and real ones. We demonstrate through extensive simulations that the proposed model is superior to other motion segmentation algorithms. The remainder of this paper is organized as follows: section II describes the related problems; section III describes the proposed method in detail; section IV presents the parameter estimation; section V sets out the experimental results; and section VI presents our conclusions. II. R ELATED P ROBLEMS A video is represented as a collection of localized spatiotemporal observations y. And x is the state variable corresponding to y. The m is the observation space dimension, τ is the length of an observed sequence, and D is the number of available features. The main objective is to segment a video consisting of N observations yi (patches) into K labels. Labels are denoted by (1 , 2 , ...,  K ). The parameter π j is the prior distribution of the observation yi belonging to the label  j , j = (1, 2, ..., K ). Let us consider the problem of estimating the posterior probability of an observation yi , i = (1, 2, ..., N), belonging to label  j . The DTM [16], [17] assumes that each observation yi follows a mixture of dynamic textures. In this method, the conditional distribution of the observations y and the state x given a hidden variable z is as follows p(y, x|z, ) =

K  N  

p(yi , xi | j )

z i j

(1)

i=1 j =1

In (1),  is the parameters. The z i j = 1 indicates yi was assigned to label  j , and z i j = 0 otherwise. The joint likelihood p(yi , xi | j ) in (1) is written in the form p(yi , xi | j ) = DT(yi , xi | j ) τ τ   = p(x1i | j ) p(xti |xti −1 ,  j ) p(yit |xti ,  j ) t =2

t =1

(2)

1211

where yi = (yi1 , yi2 , ..., yiτ ), and the observed variable at the time t is denoted by yit ∈ Rm , t = (1, 2, ..., τ ). And xti ∈ Rn is the state variable corresponding to yi . As we can see from (2), in order to model the observed dynamic texture data, the DTM models the whole feature vector as one thing. Note that, any time series observation could be used in this model. It is well known that observations in a video can be presented by different visual properties, such as lightness, color, and other texture descriptors, and cannot be classified consistently based on one single visual property alone. In order to overcome this problem, DFC with the KullbackLeibler divergence information is introduced in [20] by adding separate models for each feature. In this model, the prior distribution is different for each observation and depends on the label. The hidden Markov random field is adopted to incorporate the spatial relationships among neighboring observations. In order to estimate the model parameters, the gradient method is applied to minimize the fuzzy objective function with the Kullback-Leibler divergence information. This method considers all available features with equal weight of the dataset. It could be said that DFC is a generalization of the DTM model. In DFC, πi j is the prior probability of the observations yi belonging to the label  j , and λ is the degree of fuzziness of the fuzzy membership. Comparing the mathematical expressions of DFC with the DTM in [16], [17], we see that if we set the number of visual properties equal to one (D = 1) and the degree of fuzziness λ of the fuzzy membership equal to one (λ = 1), the objective function of DFC is similar to the Q-function of the EM algorithm for the DTM. As we mentioned above, in the DFC, the distribution is different for each observation. And the spatial relationships between the neighboring observations are considered in the decision process. Now, if we ignore these spatial constraints and evaluate a common prior distribution for all observations, the joint likelihood p(yi , xi | j ) is written in the form p(yi , xi | j ) =

D 

{ p(x1ik | j k )

k=1 τ 

×

t =2

p(xtik |xtik−1,  j k )

τ 

ik p(yik t |xt ,  j k )} (3)

t =1

In (3), = ∈ Rm×τ ×D , where yik = ik ik ik ik m (y1 , y2 , ..., yτ ), yt ∈ R , t = (1, 2, ..., τ ). And xtik ∈ Rn ik is the state variable corresponding to yik t . The state xt and ik observed variables yt are related through the LDS [21], [22]. ik And p(x1ik | j ), p(xtik |xtik−1,  j ), and p(yik t |xt ,  j ) are the multivariate Gaussian distributions [20]. As per usual in the EM literature, the distribution of the hidden variables z given the prior probabilities π is given yi

{yi1 , yi2 , ..., yi D }

p(z|π) =

N  K 

z

π j ij

(4)

i=1 j =1

where the prior probability satisfies the constraints π j ≥ 0 and

K  j =1

πj = 1

(5)

1212

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

TABLE I N OTATIONS U SED IN THE P ROPOSED M ETHOD

Given the conditional distribution p(y, x|z, ) in (1) and the distribution p(z|π) in (4), the overall joint distribution is given by p(y, x, z|π, ) = p(z|π) p(y, x|z, )

(6)

The parameters of the model are learned using the EM algorithm [10], [21], [22], by maximizing the data likelihood p(y, x, z|π, ) with respect to the parameters. Please refer to [16], [17], and [20] for additional parameter estimation details. III. P ROPOSED M ETHOD In order to present conveniently, notations used in the proposed method are as follows in Table I. As shown in section II, the main goal of a model is to best describe the statistical properties of the underlying phenomena. The existing models have relied on p(yi , xi | j ) for modeling the underlying distributions. Methods mentioned above consider all available features with equal weight of the dataset. In practice, this is not always true as some of the features might be more useful and can improve the results, while others may be irrelevant for modeling. In order to overcome this problem, we adopt the concept of feature saliency in [23]–[26]. That is, the feature k is irrelevant if its distribution is independent of the class labels. Motivated by the aforementioned observations, we define a saliency of the feature, which is expressed through the hidden variables ri j k in this paper. Note that hidden variables ri j k has boolean value (0 or 1). Where the value ri j k = 1 indicates

Fig. 1. Graphical model for the generation of the proposed model. Symbols in circles denote random variables; otherwise, they denote model parameters. Arrows show the relationship between variables. Plates denote repetitions. In a plate, the number of repetitions for the variables is depicted in the bottom-left corner.

that the observations of the feature k with label  j have been generated from the useful subcomponent (corresponding with parameter ). Otherwise (ri j k = 0), it has been generated from the noisy subcomponent (corresponding with parameter ϒ). We assume that observation y has been generated from the graphical model illustrated in Fig. 1. In our method, the distribution of the hidden variables r given the feature saliencies η is given p(r|η) =

N  K  D  i=1 j =1 k=1

r

ηki j k (1 − ηk )1−ri j k

(7)

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

In (7), ηk is the probability that the k-th feature is relevant. The feature saliencies ηk takes a value in the interval [0–1]. The feature is effectively removed from consideration when the feature saliencies obtain a close to zero value. It is attractive since it does not require combinatorial search over the possible subsets of the features which is generally an infeasible task. Next, we define a new conditional distribution of y and x given a particular value for z and r as follows

1213

In (13), N (0, Hk ) and N (0, Gk ) are normally distributed with zero mean and covariance matrix Hk ∈ Rn×n , and Gk ∈ Rm×m , respectively. The Ek ∈ Rn×n is the state transition matrix, Bk ∈ Rm×n is the observation matrix, and ok ∈ Rn is the mean vector at the initial condition. And Fk ∈ Rn×n are the covariance matrices. The multivariate distributions p(x1ik |ϒk ), ik p(xtik |xtik−1, ϒk ), and p(yik t |xt , ϒk ) in (10) are defined as p(x1ik |ϒk ) ik ik p(xt |xt −1 , ϒk ) ik p(yik t |xt , ϒk )

p(y, x|z, r, ,ϒ) = p(x|z, r, ,ϒ) p(y|z, r, x, ,ϒ)  D K N  ri j k   p(yik , xik | j k ) = i=1 j =1

k=1



ik

ik

× p(y , x |ϒk )

1−ri j k zi j

(8)

In (8), we adopt the one suggested in [23], [25], and [26] for unsupervised learning: the k-th feature is irrelevant if its distribution is independent of the class labels. The first subcomponent with parameter  generates useful data (feature) that is different for each label  j . The second subcomponent with parameter ϒ that is common to all labels  j generates noisy data. The useful data is presented by the joint likelihood p(yik , xik | j k ), which is defined as p(y , x | j k ) = DT(y , x | j k ) ik

ik

ik

ik

In (9), yi = {yi1 , yi2 , ..., yi D } ∈ Rm×τ ×D , where yik = ik ik ik m ik n (yik 1 , y2 , ..., yτ ), yt ∈ R , t = (1, 2, ..., τ ). And xt ∈ R , ik ik is the state variable corresponding to yt . The state xt and observed variables yik t are related through the LDS [21], [22] defined by  ik xt +1 = A j k xtik + N (0, Q j k ) (11) ik yik t = C j k xt + N (0, R j k ) where the N (0, Q j k ) is normally distributed with zero mean and covariance matrix Q j k ∈ Rn×n . The N (0, R j k ) is also normally distributed with zero mean and covariance matrix R j k ∈ Rm×m . The A j k ∈ Rn×n is the state transition matrix, C j k ∈ Rm×n is the observation matrix, and u j k ∈ Rn is the mean vector at the initial condition. And S j k ∈ Rn×n are the covariance matrices. The p(x1ik | j k ), p(xtik |xtik−1,  j k ), and p(yit |xtik ,  j k ) in (9) are the multivariate Gaussian distributions, and are defined as p(x1ik | j k ) = (x1ik , u j k , S j k )  1 1 1 ik T −1 ik = exp − (x1 − u j k ) S j k (x1 − u j k ) 2 (2π)n/ 2 |S j k |1/ 2 p(xtik |xtik−1,  j k ) = (xtik , A j k xtik−1 , Q j k ) ik p(yik t |xt ,  j k )

=

ik (yik t , C j k xt , R j k )

ik = (yik t , Bk xt , Gk )

(14)

p(y, x, z, r|π, η, ,ϒ) = p(z|π) p(r|η) p(y, x|z, r, ,ϒ) ⎛ ⎞⎛  D z i j ⎞ N  N  K K    ri j k zi j ⎠ =⎝ πj ⎠ ⎝ ηk (1 − ηk )1−ri j k i=1 j =1

i=1 j =1

k=1

⎛  D N  K ri j k   ×⎝ p(yik , xik | j k ) i=1 j =1

k=1

 1−ri j k × p(yik , xik |ϒk )

(9)

(10)

= (xtik , Ek xtik−1, Hk )

Note that the hidden variables z has boolean value, from the function p(r|η) in (7), p(y, x|z, r, ,ϒ) in (8), and p(z|π) in (4), the overall joint distribution is given in the form

and the joint likelihood p(yik , xik |ϒk ) is used to present the noisy data as p(yik , xik |ϒk ) = DT(yik , xik |ϒk )

= (x1ik , ok , Fk )

z i j ⎞ ⎠

(15)

Given the overall joint distribution in (15), the proposed distribution p(yik , xik |ηk ,  j k , ϒk ) in our method is given in (A.4). In order to estimate the parameters = {π, η, ,ϒ} = {π j , ηk , u j k , S j k , A j k , Q j k , C j k , R j k ,ok , Fk , Ek , Hk , Bk , Gk }, j = (1, 2, ..., K ), k = (1, 2, ..., D), we need to maximize the data likelihood p(y|π, η, ,ϒ). The maximization of p(y|π, η, ,ϒ) with respect to the parameters will be discussed in the next section. IV. PARAMETER E STIMATION Now consider the problem of maximizing the likelihood for the complete dataset {y, x, z, r}. From (A.2), taking the logarithm, the log likelihood function takes the form log p(y, x, z, r| ) =

K N  

 z i j log π j

i=1 j =1

+

D  k=1

+

D 

  ri j k log ηk + log p(yik , xik | j k )   (1 − ri j k ) log(1 − ηk ) + log p(yik , xik |ϒk )

k=1

(16) (12)

Similarly, in (10), xtik ∈ Rn is the state variable corresponding ik to yik t . The relationship between the state xt and observed ik variables yt is defined by  ik xt +1 = Ek xtik + N (0, Hk ) (13) ik yik t = Bk xt + N (0, Gk )

Then, maximizing the function in (15) is equivalent to maximizing the log likelihood log p(y, x, z, r| ) in (16). To maximize log p(y, x, z, r| ), the EM algorithm [10], [21], [22] is adopted. Each iteration of the EM algorithm consists of two steps. In the E-step, we estimate the missing information with the current parameters. Then the M-step

1214

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

computes the new parameters given the estimate of the missing information

ik ˆ ik ˆ ik , V ˆ ik ˆ ik where xˆ tik| j , V t,t | j t,t −1| j , sˆ t , Ut,t , and Ut,t −1 are estimated from the Kalman smoothing filter [27].

  E step : ( , old ) = Ex,z,r|y, log p(y, x, z, r| ) M step : new = arg max ( , old ) (17)

The expectation Ex,z,r|y factorizes as a set of nested conditional expectations Ez|y [Er|z,y [Ex|y,z,r [.]]]. Applying this to (16) and following (36)–(39) of [16], we obtain the objective function ( , old ) in (18). Where zˆ i j = p(z i j = 1|y), and rˆi j k = p(ri j k = 1|zi = j, y). ( , old ) =

N  K 

zˆ i j {log π j +

+ Ex|y,z= j,r=1 [log τ  + Ex|y,z= j,r=1 [log p(xtik |xtik−1,  j k )] +

t =2 τ 

ik ˆ ik ˆ tik| j )(xtik − xˆ tik| j )T ] V t,t | j = Ex|y,z= j,r=1 [(xt − x ik ˆ ik V ˆ tik| j )(xtik−1| j − xˆ tik−1| j )T ] t,t −1| j = Ex|y,z= j,r=1 [(xt − x ik sˆik t = Ex|y,z= j,r=0 [xt ] ik ik ik T ˆ t,t U = Ex|y,z= j,r=0 [(xtik − sˆik t )(xt − sˆt ) ] ik ik ik ik ik T ˆ t,t U −1 = Ex|y,z= j,r=0 [(xt − sˆ t )(xt −1 − sˆt −1 ) ]

εˆ i j k = zˆ i j rˆi j k ; κˆ i j k = zˆ i j (1 − rˆi j k )

ik Ex|y,z= j,r=1 [log p(yik t |xt ,  j k )]}

yˆ ik =

t =1

+

D 

(1 − rˆi j k ){log(1 − ηk )

bˆ ik =

k=1

+

Pi j k =

ik Ex|y,z= j,r=0 [log p(yik t |xt , ϒk )]}} (18)

where Ex|y,z= j,r=1 [·] stands for the conditional expectation over x given y on z = j with the useful subcomponents. And Ex|y,z= j,r=0 [·] stands for the conditional expectation over x given y on z = j with the noisy subcomponents. In (18), the value of the variable zˆ i j is given by using Bayes’ theorem which takes the form (B.1)

zˆ i j =

D  

ηk p(yik | j k ) + (1 − ηk ) p(yik |ϒk )

k=1 D   ηk p(yik |lk ) πl l=1 k=1 K 



+ (1 − ηk ) p(yik |ϒk )



τ 

P˜ i j k =

Lik = L˜ ik =

τ  t =2 τ  t =1 τ  t =2 τ 

eˆ i j k =

ηk jk) = ik ηk p(y | j k ) + (1 − ηk ) p(yik |ϒk ) p(yik |

t =1 ik T ¯ yik t (ˆst ) ; Pi j k =

τ 

Pˆ tik−1,t −1| j ; Pi j k = ik ¯ Pˆ t,t | j ; Lik = 

Lˆ tik−1,t −1; Lik =

ik ˆ ik Ex|y,z= j,r=1 [xtik (xtik−1)T ] = Pˆ t,t ˆ tik| j (ˆxtik−1| j )T −1| j = Vt,t −1| j + x ik ik ik T ˆ t,t Ex|y,z= j,r=0 [xtik (xtik )T ] = Lˆ t,t =U + sˆik t (ˆst ) ik ik ik T ˆ ik Ex|y,z= j,r=0 [xtik (xtik−1)T ] = Lˆ t,t (21) −1 = Ut,t −1 + sˆ t (ˆst −1 )

ik Lˆ t,t −1

τ 

ik Lˆ t,t

t =2

ik Lˆ t,t

(23)

ik εˆ i j k xˆ 1| j i=1 N  εˆ i j k i=1 N 

ik ˆ ik ˆ tik| j (ˆxtik| j )T Ex|y,z= j,r=1 [xtik (xtik )T ] = Pˆ t,t | j = Vt,t | j + x

t =2

t =2

ik Pˆ t,t |j

N 

u∗j k =

From (12), and (14), the objective function ( , old ) in (18) is given in (C.1). Following [21], [22], we have

τ 

T

ik Pˆ t,t −1| j

t =2 τ 



yik xtik| j ) t (ˆ

 The solution of ∂( , old ) ∂ = 0, after some manipulation, yields the estimates of at the (t + 1) step as shown

(19)

(20)

τ 

t =1

And the variable rˆi j k takes the form (B.2) rˆi j k

ik T yik t (yt ) ;

t =1

t =1

πj

τ  t =1

+ Ex|y,z= j,r=0 [log p(x1ik , ϒk )] τ  + Ex|y,z= j,r=0 [log p(xtik |xtik−1 , ϒk )] t =2 τ 

(22)

From (21) and (22), the objective function ( , old ) in (C.1) is rewritten in (23). Following [16], [20], from (C.2) and (23), in order to optimize the parameter = {π j , ηk , u j k , S j k , A j k , Q j k , C j k , R j k ,ok , Fk , Ek , Hk , Bk , Gk }, we first define the following intermediate quantities

D 

rˆi j k {log ηk k=1 p(x1ik ,  j k )

i=1 j =1

xˆ tik| j = Ex|y,z= j,r=1 [xtik ]

S∗j k =

i=1

ik εˆ i j k Pˆ 1,1| j

N 

=

 N 

− u∗j k (u∗j k )T 

εˆ i j k P¯ i j k

i=1 N 

Q∗j k =

i=1

(25)

εˆ i j k

i=1

A∗j k

(24)

N 

εˆ i j k Pi j k

−1 (26)

i=1

  T  εˆ i j k Pi j k − A∗j k P¯ i j k (τ − 1)

N  i=1

(27) εˆ i j k

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

C∗j k

=

 N 

 εˆ i j k eˆ i j k

i=1 N 

R∗j k =

i=1

N 

−1 εˆ i j k P˜ i j k

(28)

i=1



 T  εˆ i j k yˆ ik − C∗j k eˆ i j k N 

τ

(29) εˆ i j k

i=1

And, similarly, the estimates of ok , Fk , Ek , Hk , Bk , Gk at the (t + 1) step are given by K N  

o∗k =

i=1 j =1 N  K 

Fk∗ =

i=1 j =1

(30) κˆ i j k

ik κˆ i j k Lˆ 1,1

N  K  i=1 j =1

i=1 j =1

  T  κˆ i j k Lik − Ek∗ L¯ ik N  K  i=1 j =1

(33) κˆ i j k

⎞⎛ ⎞−1 ⎛ K K N  N    B∗k = ⎝ κˆ i j k bˆ ik ⎠ ⎝ κˆ i j k L˜ ik ⎠ i=1 j =1

Gk∗ =

i=1 j =1

(32)

i=1 j =1

(τ − 1)

K N  

(31)

κˆ i j k

i=1 j =1

Hk∗ =

(34)

i=1 j =1

  T  ∗ κˆ i j k yˆ ik − Bk bˆ ik τ

K N   i=1 j =1

(35) κˆ i j k

The next step is to update the estimate of the prior probability π j . Note that the prior probability π j should K satisfy the constraints in (5). The constraint j =1 π j = 1 enables N 1  zˆ i j (36) πj = N i=1

Similarly, the estimates of ηk are given by ηk =

j, c = (1, 2, ..., K )

(38)

In the next section, we will demonstrate the robustness, accuracy, and effectiveness of the proposed model, as compared with other approaches. V. E XPERIMENTS

− o∗k (o∗k )T

⎛ ⎞⎛ ⎞−1 K K N  N   

Ek∗ = ⎝ κˆ i j k L¯ ik ⎠ ⎝ κˆ i j k Lik ⎠ K N  

Step 3 (M-step): Re-estimate the parameters using (24)–(37). Step 4: Evaluate the log likelihood in (16) and check the convergence of either the log-likelihood function, or the parameter values. If the convergence criterion is not satisfied, then go to step 2. Once the parameter learning phase is complete, we obtain the value of zˆ i j for all labels. Then, in order to assign labels to each observation, a determination is made, whereby it is assigned to the label with the largest zˆ i j . yi ∈  j : IF zˆ i j ≥ zˆ ic ;

κˆ i j k sˆik 1

i=1 k=1 K N  

1215

N K 1  zˆ i j rˆi j k N

(37)

i=1 j =1

So far, the discussion has focused on estimating of the proposed model in order to assign a label  j to the observation yi . The various steps of our model can be summarized as follows. Step 1: Initialize the parameters . Step 2 (E-step): Evaluate the values zˆ i j in (19), rˆi j k in (20) using the current parameter values.

In this section, the performance of the proposed method (FSDTM) is compared to the GPCA [14], DTM in [16], [17], and DFC [20]. For the GPCA method, we used a software implementation developed by the authors of [14] publicly available at http://www.vision.jhu.edu/code. It is worth mentioning, unless otherwise specified, in the DFC method, we ignore the spatial relationships between neighboring observations and set πi j = πm j , ∀i, m = (1, 2, ..., N), λ = 1. All methods are initialized with the same initial condition. The GPCA, DTM, DFC, and FSDTM methods are implemented by using Matlab. For synthetic videos, each clip contains 60 images. The video patch size is 3 × 3 × 60 (m = 9, τ = 60). For natural real-world videos, we compare the performance of these algorithms on a real dataset from DynTex [29]. Each real clip contains 100 images (varied from 1 to 200 in steps of 2). The video patch used for the natural real video segmentation is 3 × 3 × 100 (m = 9, τ = 100). The dimension of the state space of DTM, DFC, and FSDTM is n = 9. All the experiments in this paper are implemented based on the grayscale videos. Unless otherwise specified, for the GPCA, DFC, and FSDTM, four features are used in this paper (D = 4). The 1st feature is the original grayscale video. The 2nd , 3rd , and 4th features are created by adjusting the intensity values in each image (frame) of the original grayscale video with the imadjust, histeq, and adapthisteq functions from Matlab. Note that the imadjust function will adjust image intensity values by mapping the intensity values in grayscale image to new values such that 1% of data is saturated at low and high intensities of the original image. The histeq function will enhance the contrast of images by transforming the values in an intensity image so that the histogram of the output image approximately matches a specified histogram. And adapthisteq function will enhance the contrast of the grayscale image by transforming the values using contrast-limited adaptive histogram equalization. In this study, the imadjust, histeq, and adapthisteq functions with default settings were used. Note that, the image features [histeq, imadjust, and adapthisteq] are applied for each frame independently. All the video

1216

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

patches are normalized to have a zero mean and unit variance. In this paper, the component splitting strategy is used to initialize the DTM method. The DFC methods are initialized by the DTM. For FSDTM, the initialization parameters {π j , u j k , S j k , A j k , Q j k , C j k , R j k } are similar to that of DFC. The parameters {ok , Fk , Ek , Hk , Bk , Gk } are initialized from DTM by setting the value of K equal 1. And the initialization parameter of ηk is 0.5. All the compared methods are run until convergence of the iteration steps. The original videos and the segmentation results are provided with the supplementary material. For synthetic videos, in order to compare the results obtained, the probabilistic rand (PR) index [28] is used:

In order to provide more details about the comparison in terms of implementation, in the experiment of Fig. 4, we used one synthetic video (ID = 2) with two labels. The original video is shown in Fig. 4(a). From Fig. 4(b)–(d), we show the 2nd , 3rd , and 4th feature, which are created by adjusting the intensity values in each image (frame) of the original grayscale video with the imadjust, histeq, and adapthisteq functions from Matlab. The segmentation results obtained by DTM for each feature are shown in Fig. 4(f)–(i). As we can see, these results vary in different PR values (PR = 0.896, 0.893, 0.718, and 0.903). The DFC in Fig. 4(e) only improves slightly the segmentation result. As can be easily seen in this experiment, the proposed method performs very well in segmenting this video with the highest PR value. In the next experiment, a set of the dataset containing 299 grayscale videos (99 videos with 2 labels, 100 videos with 3 labels, and 100 videos with 4 labels) is used. Fig. 5 shows the original videos and FSDTM’s results of five videos of this dataset. In Table II, we report the scores over this dataset for these GPCA, DTM, DTM with all features (both methods: concatenate all features, learn a separate DTM for each features), LDT [19], DFC (without spatial constraints), and DFC (with spatial constraints). Also, the proposed method for various feature-combination. It is worth mentioning, in Table II, both FSDTM (without spatial constraints) and FSDTM (with spatial constraints, as shown in the appendix E) for various feature-combination are shown. For FSDTM (with spatial constraints), we include the spatial constraints, which is similar to the DFC (with spatial constraints) method in [20]. The difference between them is that, instead of not using the feature selection (features with equal weights) as the DFC method, however, the proposed method with feature selection (features with different weights) is used. As shown, the proposed method obtains a very good PR value. In order to show the values of the feature selection variables ηk , a synthetic video from Fig. 6 is used. In this experiment, four features (original video, imadjust, histeq, adapthisteq) are used as shown from Fig. 6(a)–(d). The goal is to segment this video into three labels. The performance of the proposed method is shown in Fig. 6(e). As we can be seen in Fig. 6(e), the values of the feature selection variables for feature #1, #2, #3, #4 are 0.249, 0.921, 0.920, 0.768, respectively. It is worth mentioning that the higher value of the feature selection variables, the more important is the feature. Based on this consideration, we can say that feature #1 is not an important feature in this experiment. In order to clarify this point, in Fig. 6(f), we show the performance of the proposed method when we use the feature #1 only (D = 1). And in Fig. 6(g), we replace the feature set by four copies of the feature #1 (D = 4). As shown, they obtain the same results with the same value of PR. In Fig. 6(f) and (g), we can see that the segmentation result is poor when we use the feature #1 only. In Fig. 6(h) and (i), we use the same conditions as Fig. 6(f) and (g). The only difference is that, instead of using feature #1 as in Fig. 6(f) and (g), however, the feature #4 is used. As shown, compared with Fig. 6(f) and (g), the results in Fig. 6(h) and (i) are improved, with PRI = 0.931. This is reasonable, because the values of

PR(, eval )  2 [ci j pi j + (1 − ci j )(1 − pi j )] (39) = M(M − 1) i

j >i

where  is the ground truth. eval is the segmentation map under evaluation. M is the number of patches in a video. pi j is the ground truth probability that patch i and j belong to the same segment. If patch i and j belong to the same segment in eval , the value of ci j is one. Otherwise, its value is zero. The PR index takes a value in the interval [0–1]. A score of zero indicates a bad segmentation where every patch pair in the test video has the opposite relationship to every pair in the ground truth segmentation. Otherwise, a score of one indicates a good result where every patch pair in the test video has the same relationship as every pair in the ground truth segmentation. In the first experiment, we show a very simple baseline. The model (with spatial constraints) in [13] is compared with the DTM. The original video is shown in Fig. 3(a). In Fig. 3(b) and 3(c), we present the segmentation results obtained by employing the method in [13], and DTM, respectively. As we can see, the accuracy of the method in [13] is poor compared to that of DTM. This is reasonable because method in [13] is applied for segmenting static images only. This experiment demonstrates the importance of the model with dynamics for motion segmentation. In the second experiment, six synthetic videos, as shown in the 1st column in Fig. 2, are used to compare the performance of the proposed algorithm with others. The original videos were downloaded from http://visal.cs.cityu.edu.hk /research/layerdytex. These dynamic textures are the result from processes such as grass, water, bubble, etc. The objective is to segment these video clips into different labels (K = 2, 3, 4). From the 2nd to 5th column in Fig. 2, we present the segmentation results obtained by employing GPCA, DTM, DFC, and FSDTM, respectively. It can be seen that DTM, which is based on only one feature (1st feature: grayscale videos), demonstrates a better performance compared to the GPCA method, which is based on four features. However, the accuracy of the DTM method is poor compared to the DFC in the 4th column. As shown, the DFC reduces the segmentation error significantly and can segment the motion well. As shown in the 5th column, the proposed method can better segment the dynamic textures in comparison to the four previous results.

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

1217

Fig. 2. The first synthetic video segmentation, (1st column): Original video, (2nd column): GPCA, (3rd column): DTM, (4th column): DFC, (5th column): FSDTM.

Fig. 3. The first synthetic video segmentation, (a): Original video, (b): method in [13], (c): DTM.

ηk for the feature #4 is 0.768, which is much higher than that of the feature #1. Next, in Fig. 6(j), we show the performance of the proposed method when we use the combination of two originals (feature #1) and two copies from feature #3 (histeq). As shown, the result is much improved with the value of PR = 0.964. However, compare with Fig. 6(e), where all features all used, the results in Fig. 6(j) is not better. In order to show what the performance of the proposed method for different initial conditions, we show one experiment in Fig. 7. In this experiment, the original video

and the result of FSDTM from Fig. 6(g) are shown again in Fig. 7(a) and (b). Then, in Fig. 7(c), we shown the result of FSDTM when we use the initial seeding for 1st , 2nd features, and component splitting [16] for 3rd and 4th features. As we can see in this experiment, the results in Fig. 7(b) and (c) are different. This is reasonable. Note that, the proposed method, which is based on the EM algorithm, performs only local optimization. Thus, it depends on the initial starting point. In order to give an intuitive explanation about why a particular feature (e.g., imadjust) is better than the raw pixel values, we show one experiment in Fig. 8. In Fig. 8(a)–(d), we show the 24th frame of the video used in Fig. 6. As we can see, the feature #1 is different (quite dark) with other features. In the Fig. 8(e), for each feature, we show the mean square error between two frames for the label #1 (1 : the area inside the black circle). As we can see, for the feature #2 and #3 (histeq, imadjust), the changing of each frame is similar and significant. This is reasonable, because the values of ηk for the

1218

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

Fig. 4. The third synthetic video segmentation with 2 labels, (a): feature #1 [original video, ID = 2], (b): feature #2 [imadjust], (c): feature #3 [histeq], (d): feature #4 [adapthisteq], (f)–(i): DTM for each feature, (e): DFC, (j): FSDTM.

Fig. 5.

The third synthetic video segmentation, (1st row): Original video, (2nd row): FSDTM (without spatial constraints). TABLE II

C OMPARISON OF THE P ROPOSED M ETHOD W ITH O THER M ETHODS (PR I NDEX ), FOR THE T HIRD E XPERIMENT (m = 9, n = 9, D = 4)

feature #2 and #3 from Fig. 6 are 0.921 and 0.920, respectively. Compared to other features, the changing of each frame from the feature #1 is smaller. In Fig. 9, other features (edge, texture) are used. In this experiment, one video with three features, as shown Fig. 9(a), are used. The 1st feature is the original grayscale video. The 2nd , and 3rd features are created by adjusting the intensity values in each image (frame) of the original grayscale video with the Canny edge detector [30], and texture filter. For edge detector, and texture filter, we use the Matlab functions edge, and rangefilt with default settings. Fig. 9(e)–(g) present the segmentation results obtained by employing the DTM for each feature. The DFC in Fig. 9(d) with three features with equal weights only improves slightly the segmentation

result. Again, the accuracy of the proposed method, as shown in Fig. 9(h), is higher than other methods with the highest PRI = 0.963. In order to show that the feature selection improves the DTM, one experiment with four video clips, as shown in 1st column of Fig. 10. In this experiment, four features (original video, imadjust, histeq, adapthisteq) are used. As we mentioned in section II, only single features is taken into consideration for the DTM method. In this experiment, we extend to use of multiple features for DTM method. There are two different ways to implement this. In the first one, we concatenate all four feature vectors into one vector, and learn a DTM model. In the second one, we learn a separate DTM for each feature and combine the likelihoods when doing

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

1219

Fig. 6. The fourth synthetic video segmentation with 3 labels, (a): feature #1 [original video], (b): feature #2 [imadjust], (c): feature #3 [histeq], (d): feature #4 [adapthisteq], (e): FSDTM [ηk : 0.249, 0.921, 0.920, 0.768], (f): FSDTM [D = 1: original video], (g): FSDTM [D = 4: original video, original video, original video, original video], (h): FSDTM [D = 1: adapthisteq], (i): FSDTM [D = 4: adapthisteq, adapthisteq, adapthisteq, adapthisteq], (j): FSDTM [D = 4: original video, original video, histeq, histeq].

Fig. 7. Difference initializations [FSDTM, D = 4: original video, original video, original video, original video], (a): original video, (b): initial seeding for all features [Fig. 6(g)], (c): initial Seeding for 1st and 2nd features, component splitting for 3rd and 4th features.

the patch assignment. In the 2nd column and 3rd column of Fig. 10, we show the segmentation results of DTM when we use the first and second case, respectively. The 4th column of Fig. 10 clearly indicates that our proposed method achieves a better segmentation accuracy. In order to give some implementation details about the parameter n, one experiment with two labels, as shown in Fig. 11, is used. The proposed method is tested on a PC (Core i7, running at 2.78 GHz with 6GB of RAM). In this experiment, we show the obtained results by the proposed methods when the parameter n is varied with 3, 5, 7, 9. By taking a closer look at Fig. 11. It can be visualized that the segmentation result will be increased when we increase the value of n. However the computational cost also increases. The performance (RP) of the proposed method when changing n over the 99 videos with 2 labels in Table I is shown in Fig. 12. In the next experiment, one real-world video clip, as shown in Fig. 13, from the DynTex dataset is used. In the first clip, the objective is to segment these dynamic textures into two labels: vapor and other. The boundaries between the two labels are difficult to detect in this experiment. As shown in this figure, the DTM and DFC methods demonstrate better performance compared to the GPCA method. However, looking closely at the marked boxes, it can be seen that there is a small portion of pixels that have been misclassified. The proposed method in Fig. 9(e) can better segment this real-world video compared to others.

Fig. 8. The sixth synthetic video segmentation: an intuitive explanation for the experiment in Fig. 6, (a): feature #1 [24th frame, original video], (b): feature #2 [24th frame, imadjust], (c): feature #3 [24th frame, histeq], (d): feature #4 [24th frame, adapthisteq], (e): The mean square error between two frames.

Fig. 9. The fifth synthetic video segmentation, (a): feature #1 [original video], (b): feature #2 [Canny edge], (c): feature #3 [texture: rangefilt], (e)–(g): DTM for each feature, (d): DFC, (h): FSDTM.

Another real-world grayscale video, as shown in Fig. 14(a), is used. The objective is to segment this video into two labels: water moving horizontally, and water moving vertically. As seen from the result, DTM and DFC demonstrate a very strong performance compared with GPCA. However, on the bottom left hand side of the video results, looking at the marked box, a small pixel region has been misclassified. In comparison, the accuracy of the final segmented video obtained by employing the FSDTM method is very high. In Fig. 15, one real-world video, which similar with the one in [20], was used to compare the performance of the proposed algorithm with other methods. Note that, the color features (Lab space) is used in this experiment. The goal is to segment

1220

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

Fig. 10. The sixth synthetic video segmentation, (1st column): Original video, (2nd column): DTM [learn a separate model for each feature and combine the likelihoods], (3rd column): DTM [concatenate all four feature vectors into one big vector], (4th column): FSDTM.

Fig. 11. The seventh synthetic video segmentation, (1st column): original video, (2nd column): FSDTM [n = 3], (3rd column): FSDTM [n = 5], th (4 column): FSDTM [n = 7], (5th column): FSDTM [n = 9].

Fig. 12. The performance of FSDTM (RP) when changing n over the 99 videos with two labels.

this video into two labels. Fig. 15(b)–(d) present the segmentation results obtained by employing the DTM, DTM [concatenated features], DFC (with spatial constraints), and FSDTM.

Note that, in this experiment (and in Fig. 16), for the case of DTM with single feature, we followed the original DTM [16], the grayscale channel is used. In this experiment, we include the spatial constraints into the FSDTM, which is similar to the DFC method in [20]. Note that, the difference between them is that, instead of not using the feature selection (features with equal weights) as the DFC method in [20], however, the proposed method with feature selection (features with different weights) is used. Fig. 15(b)–(d) present the segmentation results obtained by employing the DTM, DTM [concatenated features], DFC, and FSDTM methods. From the visual inspection of the results, for the DTM in Fig. 15(b), if we look closely in the marked yellow box, we can see that there is a small portion of misclassified pixels. The performances of DFC and FSDTM are very similar. In the next experiment, the feature #1 (L channel), as shown in Fig. 15(e), and feature #3 (b channel) are corrupted with Gaussian noise. We show the segmentation results obtained by employing the DTM, DFC,

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

Fig. 13.

Fig. 14.

1221

The first real video segmentation, (a): Original video, (b): GPCA, (c): DTM, (d): DFC, (e): FSDTM.

The second real video segmentation, (a): Original video, (b): GPCA, (c): DTM, (d): DFC, (e): FSDTM.

Fig. 15. The third real video segmentation, (a): Original video, (b): DTM, (c): DTM [concatenated features], (d): DFC (with spatial constraints), (e): FSDTM (with spatial constraints), (f): Noisy Video (L channel), (g): DTM, (h): DTM [concatenated features], (i): DFC (with spatial constraints), (j): FSDTM (with spatial constraints) [ηk : 0.088, 0.667, 0.067].

Fig. 16. The fourth real video segmentation, (a): Original video, (b): Noisy Video (L channel), (c): DTM, (d): DTM [concatenated features], (e): DFC (with spatial constraints), (f): FSDTM (with spatial constraints) [ηk : 0.271, 0.694, 0.123].

and FSDTM methods in Fig. 15(f)–(h). In this case, the effect of noise on the final segmented images of DTM is very high. As we can be seen, the segmentation accuracy of DTM is quite poor. In comparison, the effect of noise on the final segmented image obtained by DTM [concatenated features] and DFC method is small. Looking the marked red box, as evident from the results, the proposed method outperforms DTM and DFC methods. In order to provide quantitative analysis, we show the values of the feature selection variables ηk for FSDTM. As can be seen in Fig. 15(h), the values of the feature selection variables ηk for feature #1, #2, #3 are 0.088, 0.667, 0.067, respectively. This is reasonable, because

the feature #1 and feature #3 are corrupted with Gaussian noise. Another real-world video from [20], as shown in Fig. 16(a), is used. The objective is to segment this video into three labels. Similarly with experiment in Fig. 15, the feature #1 (L channel), as shown in Fig. 16(b), and feature #3 (b channel) are corrupted with Gaussian noise. Fig. 16 clearly indicates that the proposed method with different weights [ηk : 0.271, 0.694, 0.123] improves the segmentation result. In the last experiment, we show some of the other realworld videos used for segmentation by employing GPCA, DTM, DFC, and the proposed method, respectively. The first

1222

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

Fig. 17.

The third real video segmentation, (1st row): Original video, (2nd row): GPCA, (3rd row): DTM, (4th row): DFC, (5th row): FSDTM.

row shows the original videos, followed by the corresponding segmented results in the 2nd , 3rd , 4th , and last row. Fig. 17 clearly indicates that our proposed method achieves better segmentation accuracy. VI. C ONCLUSION We have presented a new unsupervised feature selection dynamic mixture model for clustering time-varying characteristics and phenomena in this paper. The advantage of our method is that it is simple and intuitively appealing. Also, it avoids any combinatorial search, and does not require knowledge of any class labels. The EM algorithm is adopted to maximize the data log-likelihood and to optimize the parameters. The proposed method has been tested on various data types. We demonstrate through extensive simulations that the proposed model is superior to other methods for motion segmentation. One limitation of the work we have presented in this paper is the remaining parameter K . The number of model components (lables) K is currently set by user based on prior knowledge. One possible extension of this work is to adopt the variational Bayesian (VB) learning, Bayesian information criterion (BIC), or minimum description length (MDL) to automatically optimize the parameter K . This is an open question and remains

the subject of current research. Another possible extension of this work is to study the dynamic texture when the camera also moves. A PPENDIX A T HE L IKELIHOOD F UNCTION p(y, x|π, η, ,ϒ) AND THE D ISTRIBUTION p(yik , xik |ηk ,  j k , ϒk ) Motivated by [23], [26], from the function p(r|η) in (7), p(y, x|z, r, ,ϒ) in (8), and p(z|π) in (4), the overall joint distribution p(y, x, z, r|π, η, ,ϒ) is given in (15). Besides that, because the hidden variables z has boolean value, the distribution p(r|η) in (7) is rewritten [see more details in (31) and (32) of [26]] as p(r|η) =

K  D N  

r

ηki j k (1 − ηk )1−ri j k

(A.1)

i=1 j =1 k=1

From (A.1), the joint distribution p(y, x, z, r|π, η, ,ϒ) in (15) is rewritten as p(y, x, z, r|π, η, ,ϒ) = p(z|π) p(r|η) p(y, x|z, r, ,ϒ)  K N  D  ri j k   πj ηk p(yik , xik | j k ) = i=1 j =1

k=1

 1−ri j k zi j × (1 − ηk ) p(yik , xik |ϒk ) (A.2)

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

From (A.2), the likelihood p(y, x, |π, η, ,ϒ) is obtained by margining hidden variables z and r. Notice that z i j = {0, 1} and ri j k = {0, 1}, we have p(y, x|π, η, ,ϒ) K N  D    ηk p(yik , xik | j k )+(1 − ηk ) p(yik , xik |ϒk ) = πj =

i=1 j =1

k=1

K N  

D 

πj

i=1 j =1

p(yik , xik |ηk ,  j k , ϒk )

(A.3)

k=1

p(yik , xik |ηk ,  j k , ϒk ) = ηk p(yik , xik | j k ) + (1 − ηk ) p(yik , xik |ϒk ) (A.4)

T HE VARIABLE zˆ i j IN (19) AND rˆi j k IN (20)

k=1

k=1 D    ηk p(yik |lk ) + (1 − ηk ) p(yik |ϒk ) πl l=1 k=1 K 

t =1 ik T ×(yik t (yt )

ik T T − 2yik t (Ex|y,z= j,r=1 [xt ]) (C j k )

+ C j k Ex|y,z= j,r=1 [xtik (xtik )T ](C j k )T )}}} +

ηk p(yik | j k ) ηk p(yik | j k ) + (1 − ηk ) p(yik |ϒk )

(B.2)

From (12), and (14), the objective function ( , old ) in (18) is given by zˆ i j {log π j +

i=1 j =1

rˆi j k {log ηk

k=1

n 1 1 ik − log(2π) − log |S j k | − tr{S−1 j k (Ex|y,z= j,r=1 [x1 2 2 2 ×(x1ik )T ] − 2Ex|y,z= j,r=1 [x1ik ](u j k )T + u j k (u j k )T )} τ  1 n 1 {− log(2π) − log |Q j k | − tr{Q−1 + jk 2 2 2 t =2

×(Ex|y,z= j,r=1 [xtik (xtik )T ] −2Ex|y,z= j,r=1 [xtik (xtik−1 )T ](A j k )T +A j k Ex|y,z= j,r=1 [xtik−1(xtik−1 )T ](A j k )T )}}

n 1 log(2π) − log |Fk | 2 2

t =2

1 ik ik T − tr{H−1 k (Ex|y,z= j,r=0 [xt (xt ) ] 2 −2Ex|y,z= j,r=0 [xtik (xtik−1 )T ](Ek )T

1 ik ik T ik ik T T − tr{G−1 k (yt (yt ) − 2yt (Ex|y,z= j,r=0 [xt ]) (Bk ) 2 +Bk Ex|y,z= j,r=0 [xtik (xtik )T ](Bk )T )}}}} (C.1)

old

)=

K N   i=1 j =1

T HE O BJECTIVE F UNCTION

D 

(1

1 ik ik T − tr{F−1 k (Ex|y,z= j,r=0 [x1 (x1 ) ] 2 −2Ex|y,z= j,r=0 [x1ik ](ok )T + ok (ok )T )} τ  1 n + {− log(2π) − log |Hk | 2 2

( ,

A PPENDIX C

K N  

D  k=1

(B.1)

Likewise, rˆi j k = p(ri j k = 1|zi = j, y) = p(y|ri j k = 1) p(ri j k = 1)/( p(y|ri j k = 1) p(ri j k = 1) + p(y|ri j k = 0) p(ri j k = 0)). Substituting for the noisy and useful subcomponents, the value of rˆi j k is then given by

)=

1 m 1 log(2π) − log |R j k | − tr{R−1 jk 2 2 2

From (21) and (22), the objective function ( , old ) in (C.1) is rewritten as

D    ηk p(yik | j k ) + (1 − ηk ) p(yik |ϒk ) πj

( ,

{−

t =1

Following (40)–(41)  K of [16], we have zˆ i j = p(z i j = 1|y) =  πk p(yi |zi = k). Substituting (A.4) gives π j p(yi |zi = j )

old

τ 

+Ek Ex|y,z= j,r=0 [xtik−1(xtik−1 )T ](Ek )T )}} τ  1 m {− log(2π) − log |Gk | + 2 2

A PPENDIX B

rˆi j k =

+

−ˆri j k ){log(1 − ηk ) −

Equation (A.3) has a generative interpretation. From this equation, we can see that, the proposed distribution p(yik , xik |ηk ,  j k , ϒk ) [when z i j = {0, 1}, ri j k = {0, 1}] in our method has a form

zˆ i j =

1223

zˆ i j {log π j +

D 

rˆi j k {log ηk

k=1

1 n 1 ˆ ik − log(2π) − log |S j k | − tr{S−1 j k (P1,1| j 2 2 2 τ  n ik T T −2xˆ 1| {− log(2π) j (u j k ) + u j k (u j k ) )}+ 2 t =2 1 1 T ˆ ik ˆ ik − log |Q j k | − tr{Q−1 j k (Pt,t | j − 2Pt,t −1| j (A j k ) 2 2 τ  m +A j k Pˆ tik−1,t −1| j (A j k )T )}}+ {− log(2π) 2 t =1 1 1 ik ik T − log |R j k | − tr{R−1 j k (yt (yt ) 2 2 ik T −2yik xtik| j )T (C j k )T + C j k Pˆ t,t t (ˆ | j (C j k ) )}}} D  n (1 − rˆi j k ){log(1 − ηk ) − log(2π) + 2 k=1 1 1 ik T ˆ ik − log |Fk | − tr{F−1 k (L1,1 − 2ˆs1 (ok ) 2 2 τ  1 n +ok (ok )T )}+ {− log(2π) − log |Hk | 2 2 t =2 1 T ˆ ik ˆ ik − tr{H−1 k (Lt,t − 2Lt,t −1 (Ek ) 2 τ  m +Ek Lˆ tik−1,t −1(Ek )T )}}+ {− log(2π) 2 t =1 1 1 ik ik T − log |Gk | − tr{G−1 k (yt (yt ) 2 2 ik T T T ˆ ik −2yik (C.2) t (ˆst ) (Bk ) + Bk Lt,t (Bk ) )}}}}

1224

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 3, MARCH 2014

A PPENDIX D FSDTM W ITH S PATIAL C ONSTRAINTS (FSDTM-SC) In this Appendix, we describe the details of the parameter estimation for FSDTM with spatial constraints (FSDTM-SC). Note that, the difference between them is that, instead of using p(z|π) as in (4), for the FSDTM-SC, the distribution of the hidden variables z given the prior probabilities π is given p(z|π) =

N  K 

z

πi ji j

(D.1)

i=1 j =1

where the prior probability satisfies the constraints πi j ≥ 0  and Kj=1 πi j = 1. As we can see from (4) and (D.1), one of the major differences between the FSDTM and FSDTM-SC is that, the common prior distribution π j for all observations yi is used for the FSDTM method. Whereas, the prior distribution πi j of the FSDTM-SC is different for each observation yi , and depends on the neighboring observations. Adopting the idea in [20], the prior probability distribution πi j of the observation yi belonging to the label  j is assumed to follow an MRF, with update step πi j = p(ˆz i(old) |ˆz ζ(old) , β) j ij    (old) exp − m∈ζi H(ˆz i(old) |ˆ z , β) j mj =   K   (old) (old) exp − m∈ζi H(ˆz ik |ˆz mk , β)

(D.2)

k=1

where (old) indicates the iteration at the previous step. In (D.2), β is the temperature value. In this paper, it has / ζi is denoted as been set to 200 (β = 200). The ζi , i ∈ (l) the neighborhood of observation yi . And H(z i(l) j |z m j , β) is the clique potential function as shown in (17) of [20]. The EM algorithm to minimize the log-likelihood function of FSDTM-SC with respect to the parameters is the same as that of FSDTM in section IV. Note that, for the parameter learning, the difference between FSDTM and FSDTM-SC is that, the (36) is used for update the distribution π j of the FSDTM. Whereas, the (D.2) is used for update the distribution πi j of the FSDTM-SC. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers and the associate editor for their insightful comments that significantly improved the quality of this paper. R EFERENCES [1] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, “Dynamic textures,” Int. J. Comput. Vis., vol. 51, no. 2, pp. 91–109, 2003. [2] A. Ravichandran, P. Favaro, and R. Vidal, “A unified approach to segmentation and categorization of dynamic textures,” in Proc. Asian Conf. Comput. Vis., 2010, pp. 425–438. [3] D. Chetverikov and R. Peteri, “A brief survey of dynamic texture description and recognition,” in Proc. Int. Conf. Comput. Recognit. Syst., 2005, pp. 17–26. [4] J. Barron, D. Fleet, and S. Beauchemin, “Performance of optical flow techniques,” Int. J. Comput. Vis., vol. 12, no. 1, pp. 43–77, 1994. [5] B. K. P. Horn, Robot Vision. New York, NY, USA: McGraw-Hill, 1986.

[6] J. Wang and E. Adelson, “Representing moving images with layers,” IEEE Trans. Image Process., vol. 3, no. 5, pp. 625–38, Sep. 1994. [7] B. Frey and N. Jojic, “Estimating mixture models of images and inferring spatial transformations using the EM algorithm,” in Proc. IEEE Comput. Vis. Pattern Recognit., Feb. 1999, pp. 416–422. [8] G. J. McLachlan and D. Peel, Finite Mixture Models. New York, NY, USA: Wiley, 2000. [9] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer-Verlag, 2006. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via EM algorithm,” J. R. Statist. Soc., vol. 39, no. 1, pp. 1–38, 1977. [11] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Blobworld: Image segmentation using expectation-maximization and its application to image querying,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8, pp. 1026–1038, Aug. 2002. [12] M. N. Thanh and Q. M. J. Wu “Gaussian mixture model based spatial neighborhood relationships for pixel labeling problem,” IEEE Trans. Syst., Man, Cybern., B, Cybern., vol. 42, no. 1, pp. 193–202, Feb. 2012. [13] M. N. Thanh and Q. M. J. Wu, “Robust student’s-t mixture model with spatial constraints and its application in medical image segmentation,” IEEE Trans. Med. Imag., vol. 31, no. 1, pp. 103–116, Jan. 2012. [14] R. Vidal and A. Ravichandran, “Optical flow estimation and segmentation of multiple moving dynamic textures,” in Proc. IEEE Comput. Vis. Pattern Recognit., vol. 2. Jun. 2005, pp. 516–521. [15] G. Doretto, D. Cremers, P. Favaro, and S. Soatto, “Dynamic texture segmentation,” in Proc. IEEE 9th Int. Conf. Comput. Vis., vol. 2. Oct. 2003, pp. 1236–1242. [16] A. B. Chan and N. Vasconcelos, “Modeling, clustering, and segmenting video with mixtures of dynamic textures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 909–926, May 2008. [17] A. B. Chan and N. Vasconcelos, “Mixtures of dynamic textures,” in Proc. IEEE Int. Conf. Comput. Vis., vol. 1. 2005, pp. 641–647. [18] L. Barrington, A. B. Chan, and G. R. G. Lanckriet, “Modeling music as a dynamic texture,” IEEE Trans. Audio, Speech Lang. Process., vol. 18, no. 3, pp. 602–612, Mar. 2010. [19] A. B. Chan and N. Vasconcelos, “Layered dynamic textures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp. 1862–1879, Oct. 2009. [20] M. N. Thanh and Q. M. J. Wu, “Dynamic fuzzy clustering and its application in motion segmentation,” IEEE Trans. Fuzzy Syst., vol. 21, no. 6, pp. 1019–1031, Dec. 2013. [21] R. H. Shumway and D. S. Stoffer, “An approach to time series smoothing and forecasting using the EM algorithm,” J. Time Ser. Anal., vol. 3, no. 4, pp. 253–264, 1982. [22] R. H. Shumway and D. S. Stoffer, Time Series Analysis and Its Applications. New York, NY, USA: Springer-Verlag, 2000. [23] M. H. Law, M. A. T. Figueiredo, and A. K. Jain, “Simultaneous feature selection and clustering using mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 1154–1166, Sep. 2004. [24] S. Vaithyanathan and B. Dom, “Generalized model selection for unsupervised learning in high dimensions,” in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 970–976. [25] C. Constantinopoulos, M. K. Titsias, and A. Likas, “Bayesian feature and model selection for Gaussian mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 6, pp. 1013–1018, Jun. 2006. [26] Y. H. Li, M. Dong, and J. Hua, “Simultaneous localized feature selection and model detection for Gaussian mixtures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5, pp. 953–960, May 2009. [27] S. Roweis and Z. Ghahramani, “A unifying review of linear Gaussian models,” Neural Comput., vol. 11, no. 2, pp. 305–345, 1999. [28] R. Unnikrishnan, C. Pantofaru, and M. Hebert, “A measure for objective evaluation of image segmentation algorithms,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2005, pp. 34–41. [29] R. Peteri, S. Fazekas, and M. J. Huiskes, “DynTex: A comprehensive database of dynamic textures,” Pattern Recognit. Lett., vol. 31, no. 12, pp. 1627–1632, 2010. [30] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–714, Nov. 1986.

NGUYEN AND WU: UNSUPERVISED FSDTM MODEL FOR MOTION SEGMENTATION

Thanh Minh Nguyen received the B.E. (Hons.) degree in electrical and electronic engineering from the HCM City University of Technology, Vietnam, in 2004, the M.S. degree from the Department of Electrical Engineering, DaYeh University, Taiwan, in 2006, and the Ph.D. degree from the Department of Electrical Engineering, University of Windsor, Canada, in 2011. He is currently a Post-Doctoral Fellow with the Department of Electrical and Computer Engineering, University of Windsor. His research interests include computer vision and pattern recognition. The best example of his ingenuity is shown through the project on Autonomous Racing Challenge, which took top honors at University of Waterloo in 2008 and 2009. He is the first author of some prominent journal papers in his research field, such as the IEEE T RANSACTIONS ON M EDICAL I MAGING, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the IEEE T RANSACTIONS ON F UZZY S YSTEMS , the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , the IEEE T RANSAC TIONS ON S YSTEMS , M AN , AND C YBERNETICS , PART B, and the IEEE T RANSACTIONS ON I NFORMATION T ECHNOLOGY IN B IOMEDICINE , I MAGE AND V ISION C OMPUTING . He has been serving as a reviewer for the most prominent journals.

1225

Qingming Jonathan Wu (M’92–SM’09) received the Ph.D. degree in electrical engineering from the University of Wales, Swansea, U.K., in 1990. He was with the National Research Council of Canada in 1995, where he became a Senior Research Officer and Group Leader. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON, Canada. He has published more than 250 peerreviewed papers in computer vision, image processing, intelligent systems, robotics, and integrated microsystems. His current research interests include 3-D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. He holds the Tier 1 Canada Research Chair of automotive sensors and information systems. He is an Associate Editor for the IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS : S YSTEMS , the IEEE T RANSACTION ON N EURAL N ETWORKS AND L EARNING S YSTEMS , and the International Journal of Robotics and Automation. He has served on technical program committees and international advisory committees for many prestigious conferences.

An unsupervised feature selection dynamic mixture model for motion segmentation.

The automatic clustering of time-varying characteristics and phenomena in natural scenes has recently received great attention. While there exist many...
7MB Sizes 0 Downloads 3 Views