IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

703

Variational Regularized 2-D Nonnegative Matrix Factorization Bin Gao, W. L. Woo, Member, IEEE, and S. S. Dlay

Abstract— A novel approach for adaptive regularization of 2-D nonnegative matrix factorization is presented. The proposed matrix factorization is developed under the framework of maximum a posteriori probability and is adaptively fine-tuned using the variational approach. The method enables: 1) a generalized criterion for variable sparseness to be imposed onto the solution; and 2) prior information to be explicitly incorporated into the basis features. The method is computationally efficient and has been demonstrated on two applications, that is, extracting features from image and separating single channel source mixture. In addition, it is shown that the basis features of an informationbearing matrix can be extracted more efficiently using the proposed regularized priors. Experimental tests have been rigorously conducted to verify the efficacy of the proposed method. Index Terms— Audio process machine learning, nonnegative matrix factorization, single channel blind source separation, sparsity-aware learning, variational regularization.

I. I NTRODUCTION

M

ATRIX factorization is becoming a commonly used technique in understanding the latent structure of the observed data in various applications. There are many forms of matrix factorization, such as singular value decomposition (SVD) [1], [2] and nonnegative matrix factorization (NMF) [3]–[8]. Comparing with SVD, NMF gives a part-based decomposition [7], which is unique under certain conditions [8] making it unnecessary to impose the constraints in the form of orthogonality and statistical independence. These properties have led NMF to become a popular tool in many applications for exploratory analysis, for example, blind source separation (BSS) [9]–[11], data classification [12], [13], data mining [14], pattern recognition [15], object detection [16], and dimensionality reduction [17]. In conventional NMF, given a data matrix K ×L with Yk,l > 0, NMF factorizes this Y = [y1 , . . . , y L ] ∈ + matrix into a product of two nonnegative matrices as Y ≈ DH, K × where D ∈ + and H ∈ ×L + , with K and L representing the total number of rows and columns in matrix Y, respectively. The matrix D can be compressed and reduced to its integral components such that it contains a set of basis, and H is a code matrix in which its element describes the amplitude of each basis at each time (or space) point. NMF was initially proposed by Paatero and Tapper [5]. Later, Lee and Seung Manuscript received August 31, 2010; revised January 25, 2012; accepted January 25, 2012. Date of publication February 24, 2012; date of current version May 2, 2012. The authors are with the School of Electrical and Electronic Engineering, Newcastle University, Newcastle upon Tyne NE1 7RU, U.K. (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2187925

[7] developed the multiplicative update (MU) algorithm to solve the NMF optimization problem based on the least square distance and Kullback–Liebler divergence. Other families of parameterized cost functions, such as the Beta divergence [18] and Csiszar’s divergences [19], have also been presented. In addition to the MU algorithm, the projected gradient updates have also been developed by Lin [16] and Hoyer [20]. A sparseness constraint [20], [21] can be added to the cost function, and this is achieved by regularization using the L 1 norm. A method for sparse NMF is introduced by Hoyer [20] based on minimizing a penalized least squares cost function. Stadlthanner et al. [21] extend the sparse NMF to allow different sparsity constraints for D and H to resolve the permutation ambiguity in the NMF problem. Nonetheless, all the above methods require manual setting of the sparsity parameter. Finding sparse solution is an important tool for efficiently and meaningfully representing today’s exponentially growing data. Unfortunately, although there has been much research effort devoted to studying how to find sparse solutions during the last decade, how to find optimally sparse solutions is a relatively untouched issue. In conventional L 1 -norm, sparsity regularization methods are based on single regularization parameter which is commonly determined in ad-hoc manners. These methods include heuristic approach [22] and cross-validation approach [23]. The problems of these approaches originate from the fact that they do not specifically define the degree or quantification relating to the optimality of the sparseness. Moreover, most statistical problems in its original formulation involve multiple regularization parameters and these need to be determined simultaneously. Unfortunately, heuristic and crossvalidation approaches are not effectively equipped to handle this type of problem. When these approaches are used to optimize the multiple regularization parameters simultaneously, there is a tendency that the obtained solutions are under- or over-sparse. Over the recent years, several types of prior over D and H have been proposed [24]–[26]. It is assumed that the prior of D and H satisfy the exponential density and the prior for the noise variance is chosen as an inverse gamma density. In particular, Gaussian distributions are chosen for both D and H [26]. The model parameters and hyperparameters are adapted by using the Markov chain Monte Carlo [25]–[27]. In all cases, a fully Bayesian treatment is applied to approximate inference for both model parameters and hyperparameters. While these approaches increase the accuracy of matrix factorization, it only works when large dataset is available. Moreover, it consumes significantly high computational complexity at each iteration to adapt the hyperparameters especially when the dimension of D is high. The use of

2162–237X/$31.00 © 2012 IEEE

704

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

sparse representation is strongly related to the principle of parsimony [28], i.e., among all possible accounts, the simplest is considered the best. If no formal prior information is given, parsimony can be considered a reasonable guiding principle to avoid overfitting. The recently developed 2-D sparse NMF deconvolution (SNMF2-D) model [29], [30] extends the NMF model to a sparse convolution of D and H, and it optimizes the following least square cost function with sparse penalty term on H: 1 (Yk,l − Z˜ k,l ) 2 + λϕ(H) (1) CLS : 2

better part-based representations of each feature. Toward this end, we have developed a modified Gaussian prior on D to allow the proposed matrix factorization to capture the features of these patterns efficiently. This paper is organized as follows. In Section II, the new algorithm for matrix factorization is derived. Section III demonstrates the efficiency of the proposed algorithm on toy examples for feature extraction. Real application of BSS using the proposed method and comparison with other matrix factorization methods is presented in Section IV. Finally, Section V concludes this paper.

k,l

 φ τ for ∀k ∈ K and ∀l ∈ L where Z˜ k,l = τ,φ,d D˜ k−φ,d Hd,l−τ ,  τ τ τ 2 D˜ k,d = Dk,d / τ,k (Dk,d ) and ϕ(H) can be any function with positive derivative such  as L αφ − norm(α > 0) given by ϕ(H) = Hα = ( φ,d,l |Hd,l |α )1/α . The SNMF2-D is effective for extracting the basis D and code H in the presence of convoluted patterns. However, the drawbacks of SNMF2-D originate from its lack of a generalized criterion for controlling the sparsity of H. In practice, the sparsity parameter is set manually. When SNMF2-D imposes uniform sparsity on all elements of the code H, this is equivalent to enforcing each element to a fixed distribution according to the pre-selected sparsity parameter. In addition, by assigning the fixed distribution onto each individual element code, this is equivalent to constraining all the element codes to be stationary. However, some information-bearing signals, such as audio signals, are strictly nonstationary and have different temporal structure and sparseness. Hence, they cannot be realistically enforced by a fixed distribution. Moreover, since the SNMF2-D introduces many temporal shifts, this will cause more element codes to deviate from the fixed distribution. In such situation, the obtained factorization will invariably suffer from either under- or over-sparseness, which subsequently leads to ambiguity in the extracted features. Therefore, it strongly indicates that the present form of SNMF2-D is still technically lacking and is not readily suited for feature extraction, especially observations involving nonstationary signals. In this paper, a new matrix factorization algorithm is proposed for feature extraction. First, our proposed cost function is specially developed for the factorization of nonstationary signals that exhibit convolutive patterns. Second, our proposed algorithm overcomes all the limitations associated with the SNMF2-D previously discussed. Unlike the SNMF2-D, our model assigns: 1) a probability distribution to each element of H; and 2) a sparsity parameter associated with each probability distribution. This sets up a platform to enable the sparsity parameter to be individually optimized for each element code. In addition, each sparsity parameter is learned and adapted as part of the matrix factorization. This bypasses the need of manual selection in SNMF2-D and overcomes the problem of under- and over-sparse factorization. Third, our proposed algorithm enables the basis in D to be estimated more accurately for feature extraction. Since each pattern in Y has its own features, designing the appropriate basis to match these features is imperative. If these features share some degree of correlation, then this information should be captured to enable

II. P ROPOSED M ETHOD A. Generative Model In this section, the proposed NMF framework is derived. First, we consider the following generative model for Y: d  τ τ max ↓φ →τ max max φ max φ max ↓φ →τ    τ φ τ φ D H +V = Dd Hd + V (2) Y = τ =0 φ=0

τ =0 φ=0

d=1

 max lmax φ φ where, Hφ ∼ p(Hφ |φ ) = dd=1 l=1 p(Hd,l |λd,l ), V ∼ N(0, σ 2 I), and Dτ ∼ Nm (Dτ |0, τ ). These are explained in Section II.B. The vertical arrow in Dτ ↓φ denotes downward shift, which moves each element in the matrix Dτ down by φ →τ rows, and the horizontal arrow in Hφ denotes right shift, which moves each element in the matrix Hφ to the right by τ columns. In scalar representation, (k, l)th element τmax φthe φ max max τ D in Y is given by Yk,l = dd=1 τ =0 φ=0 k−φ,d Hd,l−τ +

Vk,l , where Dkτ ,d is the (k , τ , d )th element of D and φ

Hd ,l is the d , φ , l )th element of H. In (2), Dτd is constructed from the dth column of Dτ which represents the φ (d, τ )th basis vector of the factorization. Similarly, Hd is constructed from the dth row of Hφ , which represents the (d, φ)th code vector of the factorization. The matrix τ |k = 1, . . . , K and d = 1, . . . , d Dτ = {Dk,d max } denotes the φ φ τ th slice of basis D, and H = {Hd,l |d = 1, . . . , dmax and l = 1, . . . , L} denotes the φ th slice of code H, and φ φ = {λd,l |d = 1, . . . , dmax and l = 1, . . . , L} denotes the φth slice of sparse parameter . The matrix V represents the noise which is assumed to be independently and identically distributed (i.i.d.) as Gaussian distribution N(·) with variance σ 2 . The terms dmax , τmax , φmax , and lmax are the maximum number of columns in Dτ , τ shifts, φ shifts, and column length in Y, respectively. The prior over Dτ is assumed to be zeromean modified multivariate Gaussian with covariance matrix

τ , and Nm (·) denotes the modified Gaussian distribution which is given in (6). On the other hand, the prior over H is assumed to be exponential distribution, where each individual element in H is constrained to this distribution φ with independent decay parameter λd,l . The model in (2) φ differs from the conventional SNMF2-D, where λd,l is simply φ set to a fixed constant, i.e., λd,l = λ for all d, l, φ. Such setting only imposes uniform constant sparsity on all element codes in H, which enforces each element to be identical to a fixed distribution according to the selected constant sparsity

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

parameter. The ensuing consequence is that the resulting factorization will be either under- or over-sparse. B. Formulation of the Proposed Algorithm In order to formulate the proposed algorithm, we first define D = [ D0 D1 · · · Dτmax ],  = [1 2 · · · φmax ], and H = [ H0 H1 · · · Hφmax ], and then choose a prior distribution p(D, H) over the factors {D, H }. The posterior can be found by using Bayes’ theorem as

 p Y D, H, σ 2 p (D, H | ) 2 (3) p D, H Y, σ ,  = P (Y) where the denominator is a constant and therefore, the logposterior of D and H can be expressed as



 log p D, H Y, σ 2 ,  = log p Y|D, H, σ 2 + log p (D, H | ) + const.

(4)

The likelihood of the observations given D and H is written as1

 p Y|D, H, σ 2 ⎡  2 ⎤     ↓φ τ →τ φ    ⎢− Y − D d H d ⎥ ⎢   ⎥ τ φ d 1 ⎢ F⎥ = √ exp ⎢ ⎥ (5) 2 2 ⎥ ⎢ 2σ 2πσ ⎦ ⎣ where · F denotes the Frobenius norm. The second term on the right-hand side of (4) consists of the prior distribution of D and H, which are assumed to be jointly independent. In our proposed the prior over D is a factorial model  model, τ ), where the τ th slice of D is assumed p(D p(D) = ττmax =0 to be distributed as Nm (Dτ |0, τ ), i.e., zero-mean modified multivariate Gaussian with covariance matrix τ which we will now develop. Since D is nonnegative, it is natural to assume that it satisfy the multivariate rectified Gaussian [31] as

 p Dτ = −diag−1 ( τ ) u¯ τ δ dτ 



−1 √ 2K I + 2π | τ |2   τ 1 τ τ T −1 τ × exp − d − u¯

τ d − u¯ U dτ (6) 2 . . . where dτ = vec(Dτ ) = [Dτ1 T ..Dτ2 T .. · · · ..DτI T ]T , δ(dτ ) =  +∞, dτ = 0 , (•) denotes the multivariate Gaussian 0, dτ = 0  0, dτ ≤ 0 . cumulative distribution function, and U(dτ ) = 1, dτ > 0 Equation (6) practically resembles a half-normal prior. Considering the zero mean of the rectified Gaussian distribution (i.e., set u¯ τ = 0), we have 

−1 τ τ √ 2 2K I p D = 0.5δ d + 2π | τ |   T 1 × exp − dτ − u¯ τ τ−1 dτ − u¯ τ . (7) 2 1 To avoid cluttering the notation, we shall remove the upper limits from the summation terms. The upper limits can be inferred from (2).

705

In applications, D represents the basis vectors that span the domain of the input matrix Y. Although the exact values of D are case specifics, one is almost warranted that in most cases the probability of having zero-valued basis vectors, i.e., pr (D = 0), is very small. Therefore, we approximate the above as  exp − 12 dτ T τ−1 dτ , dτ ≥ 0 τ (8) p(D ) ∝ 0, dτ < 0  1,1,τ ··· 1,I,τ  . . .. where I = dmax and τ = ... is the covariance ..

I,1,τ ··· I,I,τ

matrix of vec(Dτ ). In (8), “T” denotes matrix transpose, vec(·) represents the column vectorization, and i, j,τ = E[Dτi Dτj T ] is the cross-correlation between the basis vectors Dτi and Dτj , “E[·]” denotes the expectation. The covariance matrix τ can be partitioned as ⎡ ⎤

1,1,τ 0 0 ··· 0 ⎢ 0 2,2,τ 0 · · · 0 ⎥ ⎢ ⎥ ⎢ .. .. ⎥ .. .. ⎢ . . . ⎥

τ = ⎢ . 0 ⎥ ⎢ ⎥ .. .. .. ⎣ 0 . . . 0 ⎦ 0 0 · · · 0 I,I,τ   

diag,τ ⎡

0

1,2,τ

···

···

1,I,τ .. . .. .

⎢ ⎢ 2,1,τ 0 2,3,τ ··· ⎢ ⎢ .. . .. . + ⎢ . 3,2,τ . . ⎢ ⎢ . . . . .. .. .. ⎣ ..

I −1,I,τ

I,1,τ · · · · · · I,I −1,τ 0  

off,τ

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ . (9) ⎥ ⎥ ⎦ 

−1 In (9), 0 is a K × K matrix with zero elements and diag,τ is the inverse covariance matrix of diag,τ . The inverse covariance matrix can be approximated as −1

τ−1 = diag,τ + off,τ −1 −1 −1 ≈ diag,τ − diag,τ

off,τ diag,τ

= diag,τ − off,τ

(10)

−1 −1 −1 where diag,τ = diag,τ , off,τ = diag,τ

off,τ diag,τ . The (i, j) th sub-matrix of off,τ is given by −1

i, j,τ −1 off,i, j,τ = i,i,τ j, j,τ .

(11)

Assuming that the elements within the same basis vector are 2 I, uncorrelated, the above matrices simplify to i,i,τ = σi,τ 2 2

j, j,τ = σ j,τ I, and i, j,τ = ci, j,τ I, where σi,τ is the variance of the basis vector Dτi and ci, j,τ is the cross-covariance between Dτi and Dτj . Thus, off,i, j,τ can be expressed as

where

off,i, j,τ = μi j τ I

(12)

−2 −2 μi j τ = σi,τ σ j,τ ci, j,τ .

(13)

706

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

Using above, we may cast (10) into two terms T . 1 − log p(Dτ ) = vec Dτ diag,τ vec Dτ 2 T 1 − vec Dτ off,τ vec Dτ 2 T 1 (14) = γ − vec Dτ off,τ vec Dτ 2 where “=” ˙ denotes equality up to constant terms. The first term γ = (1/2)vec (Dτ )T diag,τ vec (Dτ ) relates only to the power of Dτ , while the second term  T vec (Dτ )T off,τ vec (Dτ ) = d, j,(d = j ) μd j τ Dτd Dτj measures the sum of weighted correlation between Dτi and Dτj for all i, j, i = j . Hence, the interesting information is actually contained in the second term which represents the prior information of the basis vectors. By including this term, the underlying correlation between the different basis vectors can be incorporated into the matrix factorization to yield results that reflect on this prior information. Therefore, with the factorial model in (14) the desired constraint assumes the form τ max   T . f (D) = − log p D τ (dτ ) = − μd j τ Dτd Dτj . τ =0

d, j (d = j )

τ

 φ φ The sparsity term φ,d,l λd,l Hd,l forms the L 1 -norm regularization to resolve the permutation ambiguity by forcing all structure in H onto D. Therefore, the sparseness of the solution in (17) is highly dependent on the regularization parameter φ λd,l . 1) Estimation of the Basis and Code: Defining Z =    τ ↓φ φ →τ Hd , the derivatives of (17) with respect d τ φ Dd to Dτ and Hφ of the proposed model are given by ∂L

∂ Dkτ ,d

=−

φ 1  Yk +φ,l − Z k +φ,l Hd ,l−τ

2 σ φ,l



 j =d

∂L φ

∂ Hd ,l

=−



μd j τ Dkτ , j

∂L 1  τ Dk−φ ,d Yk,l +τ − Z k,l +τ + . 2 φ

σ ∂ Hd ,l

τ,k (19)

Using the approach in [5], the multiplicative learning rules in matrix notation become 

(15) In this paper, the probabilistic framework is used for the purpose of developing a platform to incorporate the statistical correlation between Dτi and Dτj into the matrix factorization as part of the regularization. In feature extraction, such constraint is required in order to fully extract the basis especially in situation where the patterns contain overlapping features. Despite our proposed prior model for D stems from the modified Gaussian distribution, it is a combination of constrained and unconstrained parameterization of the inverse covariance matrix. The prior on H is constrained to an exponential distribution with independent decay parameters, namely   φ φmax d max max l max

 φ φ  φ φ p H | = p Hd,l |λd,l p (H|) = φ=0

φ=0

   φ φ φ φ φ where p(Hd,l |λd,l ) = φ d l λd,l exp(−λd,l Hd,l ). Following (16), the negative  log φpriorφ on H is φdefined as f (H) = − log p(H|) = φ,d,l (λd,l Hd,l − log λd,l ). By substituting (5), (8), (16) into (4), the negative log posterior of D and H is given as follows:  2       ↓φ →τ . 1  φ τ Y − L = D d H d  + f (H) + f (D) 2σ 2    τ d φ F  2    φ φ    ↓φ →τ φ  1  τ Y −  + = λd,l Hd,l D H d d  2σ 2    τ φ d φ,d,l F    T φ − log λd,l − μd j τ Dτd Dτj . (17) d, j (d = j )

τ

Dτ ← Dτ ·

Hφ ← Hφ ·



φ

τ

T

φ

Y

(20)

↑φ →τ T Z Hφ +Dτ τ T

 

↑φ →τ Hφ

↓φ T ←τ Dτ

τ ↓φ T ←τ Dτ Z

Y

(21)

∂L + ∂H φ

where τ is a I × I matrix whose (d, j ) th element is given by μd j τ except its diagonal elements being zeros. →τ

2) Estimation of the Adaptive Sparsity Parameter: Since Hφ →0

is obtained directly from the original sparse code matrix Hφ , it suffices to compute the regularization parameters associated →0

d=1 l=1

(16)

φ,d,l

(18)

only with Hφ . Therefore, the cost function in (17) can be set with τmax = 0 as 1 F(H, λ) = 2σ 2

 2    φ max   ↓φ  φ  vec (Y) − I ⊗ D vec H      φ=0

F

+

φ max φ=0

φ

max  φ T T vec Hφ − λ log λφ 1 + d0T Ad0

φ=0

(22) with “⊗” is the Kronecker product, 1 = [1, . . . , 1]T is vector contains dmax × lmax ones, I is the identity matrix, d0 = vec(D0 ), and A is a (dmax K ) × (dmax K ) matrix whose diagonal elements are zeros while the off-diagonal elements are made up of μdj as appropriated by the vectorization of D0 .

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

Defining the following terms: ⎡ ↓0

Similarly



↓1

↓φmax ⎥ . . . ⎢ y = vec(Y), D = ⎣I ⊗ D.. I ⊗ D.. · · · ..I ⊗ D ⎦ ,



vec(H0 )





λ0





φ λ1,1 φ λ2,1

σ2



Thus, (22) can be rewritten in terms of h as 2 T 1    F(h, λ) = − Dh y  + λT h− log λ 1 + d0T Ad0 . 2σ 2 (24) Note that h, 1, and λ are vectors of dimension R × 1, where R = dmax × lmax × (φmax + 1). To determine λ, we use the Expectation-Maximization (EM) algorithm and treat h as the hidden variable, where the log-likelihood function can be optimized with respect to λ. To reiterate our aim, we are not developing a full Bayesian inference on the generative model in (2). Our aims are twofold. To develop a variational approximation to the posterior distribution of H and by using this approximate distribution, we further develop an adaptive strategy to infer the value of each sparsity parameter for every element in H. By using the Jensen’s inequality, it can be shown that for any distribution Q(h), the log-likelihood function satisfies the following:

⎞ ⎛  2 

 p y, h|λ, D, σ ⎠ dh. ln p y|λ, D, σ 2 ≥ Q h ln ⎝ Q h (25) One can easily check that the distribution that maximizes the right-hand side of (25) is given by Q(h) = p(h|y, λ, D, σ 2 ), which is the posterior distribution of h. In this paper, the posterior distribution is represented in the form of Gibbs distribution  $ % $ % 1 exp −F h , where Z h = exp −F h dh. Q h = Zh (26) The functional form of the Gibbs distribution in (26) is expressed in terms of F h and this is crucial as it will enable us to simplify the variational optimization of λ. The maximum likelihood estimation of λ can be expressed by 

λ M L = arg max ln p y|λ, D, σ 2 

= arg max λ



= arg max λ

= arg max λ



 Q h ln p y, h|λ, D, σ 2 dh





 Q h ln p y|h, σ 2 , D + ln p h|λ dh





Q h ln p y|h, σ 2 , D dh.

= arg max σ2

(28)

Since each element of H is constrained to be exponential distributed with independent decay parameters, this gives  p h|λ = g λg exp(−λg h g ), where λg is the gth element of λ and h g is the gth element of h. Therefore, (27) becomes  ML λ = arg max Q h ln λg − λg h g dh. (29) λ

The Gibbs distribution Q h treats h as the dependent variable while assuming all other parameters to be constant. As such, the functional optimization of λ in (29) is obtained by differentiating the terms within the integral with respect to λg and the end result is given by λg = &

1 for g = 1, 2, . . . , R. h g Q h dh

(30)

Since p(y|h, D, σ 2 ) = (2πσ 2 )−N0 /2 exp(−(1/2σ 2 ) y − Dh2 ) with No = K × L, the iterative update rule for 2 σM L is given by  

N0  2 ln 2π σ 2 σM = arg max Q h − L 2 σ2    2 1   − 2  y − Dh  dh 2σ   2   1   = (31) Q h  y − Dh  dh. N0 Despite the simple form of (30) and (31), the integral is difficult to compute analytically and it is therefore necessary to seek an approximation to Q h . We note that the solution h naturally partition its elements into distinct subsets h P and h M consisting of components ∀ p ∈ P such that h p = 0, and components ∀m ∈ M such that h m > 0. Thus, the F(h) can be expressed as follows: F(h, λ) =

+

2 T 1    0T 0 h y−D  +λT M M M h M +d M A M d M − log λ M 1M 2 2σ    F(h M , λ M )

2 T 1    T 0T 0 y − D h  P P  + λ P h P + d P A P d P − log λ P 1P 2   2σ

F (h P ,λ P ) ( '  T  1  2 + 2 2 DM hM D P h P − y 2σ    U





 Q h ln p y|h, σ 2 , D + ln p h|λ dh Q h ln p h|λ dh.

 2 σM L = arg max

······ ⎥ ⎥ ⎢ ⎢ ······ ⎢ ⎥ ⎢ vec(H1 ) ⎥ ⎢ λ1 ⎥ ⎢ ⎥ ⎥ ⎥ ⎢ ⎢ ······ ⎥. ⎥ , λ = ⎢ ······ ⎥ , λφ = ⎢ h=⎢ .. ⎢ ⎥ ⎥ ⎢ ⎢ . ⎥ .. ⎥ ⎢ ⎢ .. ⎥ ⎣ ⎦ . . ⎦ ⎦ ⎣ ⎣ φ ······ ······ λ dmax ,lmax vec(Hφmax ) λφmax (23)

λ

707

(27)

= F(h M , λ M ) + F(h P , λ P ) + U.

(32)

In (32), the term y2 in U is a constant and the crossterm (D M h M )T (D P h P ) measures the orthogonality between D M h M and D P h P . We simply the expression in (32) by discounting the contribution from these terms and let F(h) be

708

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

approximated as F(h, λ) ≈ F(h M , λ M ) + F(h P , λ P ). Given this approximation, Q(h) can be decomposed as $ % 1 exp −F(h, λ) Zh ) * 1 $ % 1 exp −F(h P , λ p ) exp −F(h M , λ M ) ≈ ZP ZM = Q P (h P )Q M (h M ) (33)

Q(h, λ) =

where D P is the sub-matrix of D that corresponds to h P , D M is the& sub-matrix of D that corresponds exp[−F(h P , λ P )]dh P , and Z M = to & hM , Z P = exp[−F(h M , λ M )]dh M . In order to characterize Q P (h P ), we need to allow some positive deviation to h P (any negative values of h P will be rejected since NMF only allow nonnegative values). Hence, h P must take on zero and positive values in Q P (h P ). The distribution Q P (h P ) can be approximated by using the Taylor expansion about the maximum a posteriori (MAP) estimate, hMAP given in (21) ⎫ ⎧  T ⎬ ⎨  ∂ F  1 T Q P h P ≥ 0 ∝ exp − h h − C h P P P ⎭ ⎩ ∂h h M AP 2 P P    T 1 T 1 T M AP = exp − Ch − 2 D y + λ hP − hP CP hP σ 2 P (34) T

T

where C P = σ12 D P D P and C = σ12 D D. Although Q P (h P ) is obtained in the form of (34), its integral is difficult to evaluate and does not yield closed form analytical expression of the moments, which subsequently prohibits inference of the sparsity parameters. Alternatively, we may variationally approximate Q P (h P ) by using a fixed form distribution that can yield a closed analytical expression of the moments. Since h P takes on zero and positive values only, a suitable fixed form distribution is to use the factorized exponential distribution given by    1 −h p exp . (35) Qˆ P h P ≥ 0 = up up p∈P

The variational parameters u = {u p } for ∀ p ∈ P are obtained by minimizing the Kullback–Leibler divergence between Q P and Qˆ P  Qˆ P h P dh P u = arg min Qˆ P h P ln u Q P hP  ) * = arg min Qˆ P h P ln Qˆ P h P − ln Q P h P dh P . u

(36) Solving (36) for u p leads to the following update [32]: 1 

u← p up

−bˆ p +

bˆ 2p + 4 

ˆ 2 Cu

ˆ Cu

u˜ p

1) Initialize Dτ and Hφ with nonnegative random values.    ↓φτ →τφ 2) Compute Z = Dd Hd and update u p using (37). ⎧d τ φ ⎨ 1φ if g ∈ M φ φ φ Hg 3) Assign λg = , where Hg and Ug are ⎩ 1φ if g ∈ P Ug

the matrix representations of (39). '



 T

(  T  1 2 4) Compute σ = N0 y − Dh y − Dh + Tr D DX . 5) Update

p

The detailed derivation for updating u p is presented in Appendix. The approximate distribution for components h M can be





·

↓φ T ←τ τ τ D Y T  ↓φτ ←τ φ τ D Z +λ g



   ↓φτ →τφ Dd Hd using the updated Hφ . d

τ

7) Update Dτ ← Dτ ·

φ





φ

T ↑φ →τ Hφ

φ Y T ↑φ →τ Hφ +Dτ τ T

Z

8) Repeat steps 2 to 7 until convergence to a desired threshold is reached.

obtained substituting F(h M ) into Q M (h M ) as follows: * ) 1 ) exp −F(h M , λ M Z M'  ( 1 1 T h M C M h M − 2 yT D M h M + λ M h M . ∝ exp − 2 σ (38)

Q M (h M ) =

In (38), Q M (h M ) has the functional form equivalent to a multivariate Gaussian distribution. Therefore, Q M h M can be represented as the unconstrained Gaussian with mean hMAP M −1 and covariance C M , where C M is the sub-matrix of C. Substituting (33), (35), (38) into (30), the sparsity parameter can be inferred as ⎧ 1 1 ⎨ & = h MAP if g ∈ M h g Q M (h M )dh M g (39) λg = 1 1 = ug if g ∈ P ⎩ & ˆ h g Q P (h P )dh P and its covariance X is given by   −1

CP , if a, b ∈ M X ab = ab 2 u a δab , otherwise.

(40)

Similarly, the inference for σ 2 can be computed from (31) as '



 T

(  T  1  2 y − Dh y − Dh + Tr D DX σ = (41) N0 

(37)



6) Compute Z =

where

p

.

Algorithm 1 Proposed VRNMF2-D Method

hg =

2

h MAP if g ∈ M g . ug if g ∈ P

Algorithm 1 presents the main steps of the proposed method. We term the above algorithm as the variational regularized 2-D NMF (VRNMF2-D).

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

Fig. 1.

Real basis and code of the simulated mixed data.

Fig. 2. Estimated results based on uniform and constant sparsity factorization φ with small weight, i.e., λd,l = 0.01, and without prior on D.

709

Fig. 3. Estimated results based on uniform and constant sparsity factorization φ with large weight, i.e., λd,l = 100, and without prior on D.

Fig. 4. Estimated results based on adaptive sparsity factorization and without prior on D.

III. T OY E XAMPLES In this section, the SNMF2-D with uniform sparsity parameter and the proposed method will be tested in their ability to extract the basis and code from a simulated mixed data. The simulated data is generated to have high degree of φ pattern overlap. To investigate the effects of μd j τ and λd,l on the performance of feature extraction, the following four experiments have been developed. 1) Uniform and constant sparsity factorization with small φ weight on sparseness e.g., λd,l = 0.01 and without prior on D, i.e., μd j τ = 0. 2) Uniform and constant sparsity factorization with large φ weight on sparseness, for example, λd,l = 100 and without prior on D. 3) Adaptive sparsity factorization and without prior on D. 4) Adaptive sparsity factorization with prior on D. Fig. 1 shows the real basis (i.e., vertical panels) and code (i.e., horizontal panels) of the simulated mixed pattern. The basis D consists of one circle and one triangle features. These features are convolved with the code H given at the top panels to yield the data matrix Y which is a mixture of both patterns. Figs. 2–5 show the matrix factorization results corresponding to each of the above experiments. It is seen that the uniform sparsity factorization with either too small or too large weight of sparseness has failed to identify the correct basis and code. The major reason stems from the high degree of pattern

Fig. 5. Estimated results based on adaptive sparsity factorization and with prior on D, i.e., μd j τ = 10.

overlap between the circle and triangle features in the mixed dataset. Since the sparseness is uncontrolled, the larger parts of the pattern overlap will cause more errors in estimating the basis while the code tends to be more ambiguous. This decreases the possibility of correct assignment of the basis to each feature, and subsequently results in poorer extraction and reconstruction performance as shown in Figs. 2 and 3. For example, one could see the extracted codes (i.e., upper panels of Fig. 2) are almost identical and thereby cause parts

710

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

of the triangle and circle features missing from the figure. In Fig. 3, the codes are too sparse and thereby force the algorithm to extract similar features from the mixed patterns. On the other hand, Fig. 4 shows a better extraction result by using only the adaptive sparsity parameters (without imposing prior on D), while Fig. 5 shows the best result when both regularizations (i.e., adaptive sparsity and modified Gaussian prior) are used. However, if only the adaptive sparsity is adopted, this may yield a sub-optimal performance, which is evident from Fig. 4, where the circle feature has not been fully extracted. Nonetheless, the performance of feature extraction also depends on the correlation between the real bases of the mixed pattern. Visual inspection of Fig. 1 shows that the real basis shares some degree of commonality and therefore induces correlation. Thanks to the modified Gaussian prior, this correlation is explicitly modeled by μi j τ in the proposed method. This enables the estimated basis vectors Dτ1 and Dτ2 to take advantage of the correlation in learning the real basis directly from the mixed pattern. This explains the reason as to why Fig. 5 shows better performance than Fig. 4. Therefore, the analysis results have unanimously indicated φ the importance of selecting the correct sparseness λd,l for each element code and of incorporating the correlation μi j τ between the different basis vectors in order to arrive at the optimal performance of feature extraction. In the next section, the proposed method will be further tested on real application of single channel source separation (SCSS). A series of performance comparison with other matrix factorization methods will also be presented.

IV. R EAL A PPLICATION : S OURCE S EPARATION The proposed method is tested on recorded audio signals. All experiments are conducted using a PC with Intel Core i5 CPU at 3.2GHz and 4GB RAM. The experiments consist of four types of audio sources (i.e., male speech, female speech, jazz, and piano music), two types of mixture (i.e., mixture of music and speech signals, and mixture between different type of music signals). All mixtures are sampled at 16 kHz. The frequency axis of the obtained spectrogram is logarithmically scaled and grouped into 175 frequency bins in the range of 50 Hz to 8 kHz with 24 bins per octave. This corresponds to twice the resolution of the equal tempered musical scale. In the proposed method, the settings are as follows: 1) for jazz and speech mixture, τmax = 2, φmax = 2, and μd j τ = 0.5 for ∀d, j, τ ; 2) for jazz and piano mixture, τmax = 6, φmax = 9, and μd j τ = 2 for ∀d, j, τ ; and 3) for piano and speech mixture, τmax = 6, φmax = 9, and μd j τ = 1.5 for ∀d, j, τ . These parameters are selected after conducting the MonteCarlo experiment over 100 independent realizations of each mixture. The separation performance is evaluated in terms of the signal-to-distortion ratio (SDR), source-to-artifacts ratio (SAR) and source-to-interference ratio (SIR), which are forms of perceptual measure. MATLAB routines for computing this criterion is obtained from the SiSEC’08 webpage [33], [34].

(a)

(b)

(c)

Fig. 6. Time-domain representation and spectrogram of (a) jazz music, (b) male speech, and (c) mixed signal.

A. Time–Frequency Representation of SCSS The SCSS problem can be treated as one  observation dmax and several unknown sources, namely y(t) = d=1 x d (t), where d = 1, . . . , dmax denotes the source number and t = 1, 2, . . . , T denotes time index. The goal is to recover the sources x d (t) when only the observation mixed signal y(t) is available. The time–frequency (TF)representation of the max mixture y(t) is given by Y ( f, ts ) = dd=1 X d ( f, ts ), where Y ( f, ts ) and X d ( f, ts ) denote the TF components obtained by applying the short-time Fourier transform (STFT) [e.g., Y ( f, ts ) = STFT (y(t))]. The time slots are given by ts = 1, 2, . . . , Ts while frequencies by f = 1, 2, . . . , F. Since each component is a function of ts and f , this is represented as f =1,2,...,F f =1,2,...,F Y = [Y ( f, ts )] ts =1,2,...,Ts and Xd = [X d ( f, ts )]ts =1,2,...,Ts . The power spectrogram is defined as the squared magnitude STFTand hence, its matrix representation is given by dmax .2 where the superscript “·” represents |Y|.2 ≈ d=1 |Xd | element-wise operation. The matrices we seek to determine max are {|Xd |.2 }dd=1 which will be obtained by using the proposed   ↓φτ →τφ τ ˜ d |.2 = matrix factorization as |X τ φ Dd Hd , where Dd and φ Hd are estimated using (20) and (21). Once these matrices are estimated, the dth binary mask [35] is constructed according to  .2 .2 ˜ ˜ 1, if X d > X j , d = j maskd = (42) 0, otherwise. Finally, the estimated time-domain sources are obtained as x˜ d = STFT−1 (Maskd · Y), where x˜ d = x˜d (1), . . . , x˜d (T )]T denotes the dth recovered source in the time-domain. B. Impact of Regularizations on Matrix Factorization and Source Separation The impact of regularization will be investigated in this section. Specifically, we will show that when the sparse

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

711

(a)

(b)

(c)

(d)

(e)

(f)

φ Fig. 7. Estimated Dτd and Hd using uniform constant sparsity with μd j τ = 0 φ and λd,l = 0.01.

Fig. 10. Spectrogram of the separated results. (a)–(d) Recovered sources φ using uniform sparsity factorization with μd j τ = 0 and (λd,l = 0.01 and φ

λd,l = 100, respectively). (e) and (f) Recovered sources using the proposed method.

φ

Fig. 8. Estimated Dτd and Hd using uniform constant sparsity with μd j τ = 0 φ and λd,l = 100.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 11. Time domain of the separated results. (a)–(b) and (c)–(d) denote the recovered sources using uniform sparsity factorization with μd j τ = 0 and φ φ (λd,l = 0.01 and λd,l = 100, respectively). (e) and (f) denote the recovered sources using the proposed method.

φ

Fig. 9. Estimated Dτd and Hd using the proposed adaptive sparsity with μd j τ = 0.

constraints are not controlled, the resulting matrix factorization will be under- or over-sparse, which will lead to ambiguity in the estimation of recovered sources. Fig. 6 shows the case of a mixture of male speech and jazz music. Figs. 7 and 8 show the factorization results based on the uniform constant sparsity φ matrix factorizations as described in Section III. The code Hd in Fig. 7 reveals that the resulting factorization is under-sparse φ when all λd,l are uniformly fixed to a small value, whereas φ Fig. 8 shows the case of over-sparse factorization when all λd,l are uniformly fixed to a large value. It is also noted that the φ estimations of Dτd and Hd are rather coarse and have resulted in poorer estimation of the recovered sources. On the other hand, Fig. 9 shows the obtained factorization that is just sparse enough by using the proposed adaptive sparsity parameters.

In Figs. 10 and 11(a)–(d) clearly reveal that good separation performance require suitably controlled sparse regularization. In the case of under-sparse factorization, the basis associated with each source repeats too frequently in the spectrogram and this leads to redundant information which still retains the mixed components as noted in (a) and (b). In the case of over-sparse factorization, these bases occur too rarely in the spectrogram and this results in less information, which do not fully recover the original source as noted in (c) and (d). In the case of the proposed method, it assigns a regularization parameter to each element code which is individually and adaptively tuned to yield the optimal number of times the basis of a source recurs in the spectrogram. This is noted in (e) and (f) which clearly show the optimal separation result. Readers may compare the obtained results in Figs. 10 and 11 with that of Fig. 6. In practice, the actual statistics for computing μd j τ given in (13) may be unknown. In this case, the selection of μi j τ

712

Fig. 12.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

SDR results as a function of μd j τ and sparseness. Fig. 13.

will depend on the type of sources and require manual setting which can be obtained by means of the Monte–Carlo experiment. Hence, for the purpose of investigating the effects of φ μd j τ in conjunction with λd,l on the separation performance, the following three cases have been conducted. φ Case 1: No sparseness λd,l = 0 and μd j τ is varied from 0, 0.5, 1.0, . . . , 5. φ Case 2: Uniform and constant sparseness λd,l = c and μd j τ is varied from 0, 0.5, 1.0, . . . , 5. Case 3: Adaptive sparseness according to (39) and μd j τ is varied from 0, 0.5, 1.0, . . . , 5. The constant value of c in Case (2) is determined for each mixture type based on Monte Carlo experiment of 100 independent realizations. For the mixture of music and speech, c = 1, and for the mixture of music and music, c = 0.4. The separation results in terms of the SDR are given in Fig. 12. According to Fig. 12, Case (3) yields the best overall performance, where the average improvement over the other two cases can be summarized as follows: 1) for music mixture, the average SDR improvement is 1.1 dB per source; and 2) for mixtures of music and speech (top two panels of Fig. 12), the improvement is 1.5dB per source. The results have also clearly indicated that there are certain values of μd j τ , where the algorithm performs the best. In the case of music and speech mixtures, the best performance is obtained when μd j τ ranges from 0.5 to 1.5. As for music mixture, the best performance is obtained when μd j τ ranges from 1.5 to 2.5. On the contrary, it is noted that when μd j τ is set either too low or high, the separation performance tends to degrade. Based on many Monte–Carlo realizations of testing, the proper range of μd j τ is between 0 and 3. It is also worth pointing out that the separation results are rather coarse when the factorization is nonregularized (i.e., without prior pdf on D and H). Here, we see that for music mixture, the SDR is only 2.3 dB, and for mixtures of music and speech, the average SDR is only 3.5 and 2.2 dB, respectively. By incorporating regularization (i.e., φ using μd j τ and λd,l ), the performance increases significantly for all types of mixture. This is clearly evident in the case of jazz and speech mixture where the SDR result scales up

(a)

(b)

(c)

(d)

φ=0

Convergence trajectory of the sparsity parameter (a) λ1,1 ,

φ=0 φ=0 φ=0 (b) λ1,5 , (c) λ1,10 , and (d) λ1,15 .

to 7.5 dB, whereas for the case of without regularization the SDR result is only 3.5 dB. This amounts to a significant 4-dB performance improvement using the proposed regularization than that without regularization. C. Adaptive Behavior of Sparsity Parameter In this section, the adaptive behavior of the sparsity parameters by using the proposed method will be demonstrated. Several sparsity parameters have been selected to illustrate its adaptive behavior. Fig. 13 shows the convergence trajectory φ=0 φ=0 φ=0 of four adaptive sparsity parameters λ1,1 , λ1,5 , λ1,10 , and φ=0 λ1,15 corresponding to their respective element codes. All φ sparsity parameters are initialized as λd,l = 1 for all d, l, φ, and are subsequently adapted according to (39). After 150 iterations, the above sparsity parameters converge to their steady states. By examining Fig. 13, it is noted that the converged steady-state values are significantly different for φ=0 φ=0 each sparsity parameter (e.g., λ1,1 = 0.96, λ1,5 = 0.18, φ=0 φ=0 λ1,10 = 0.29, and λ1,15 = 0.08) even though they started at the same initial condition. This shows that each element code has its own sparseness. In addition, it is worth pointing out that in the case of jazz and speech mixture, the SDR result φ rises to 7.5 dB when λd,l is adaptive with μd j τ = 0.5. This represents a 1.3-dB per source improvement over the case of uniform constant sparsity (which is only 6.2 dB in Fig. 12). On the other hand, when no sparsity is imposed onto the code the SDR result immediately deteriorates to 5 dB. This represents a 2.5-dB per source depreciation compared with the proposed adaptive sparsity method. From above, the results are ready to suggest that the performances of source separation have been undermined when the uniform constant sparsity scheme is used. On the contrary, improved performances can be obtained by allowing the sparsity parameters to be individually adapted for each element code. In addition to the convergence trajectory, we have plotted the histogram of the converged adaptive sparsity parameters in

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

Fig. 14.

Histogram of the converged adaptive sparsity parameter. TABLE I C OMPARISON R ESULTS OF S PARSITY A SSIGNMENT BASED ON THE H ISTOGRAM IN F IG . 14 Methods Case 1 Case 2 Case 3 Case 4 Case 5

SDR 5.0 5.8 5.6 5.4 7.5

SAR 8.2 8.7 8.4 8.3 9.4

SIR 9.2 9.0 8.8 9.6 12.2

Fig. 14. The figure suggests that the histogram can be represented as a bimodal distribution. We have used the Gaussian mixture model (GMM) [36] to learn the distribution of this histogram, and the result produces two Gaussian distributions with mean 0.16 and 1.1. The global mean of the GMM is given by 0.92. With the GMM analysis, we can proceed to further investigate the assignment of sparsity parameters and compare them with the adaptive approach. We shall consider the following sparseness assignments. φ Case 1: No sparseness λd,l = 0. φ Case 2: Uniform and constant sparseness λd,l = c = 0.16, which corresponds to the mean of the first Gaussian distribution of the GMM. φ Case 3: Uniform and constant sparseness λd,l = c = 1.1, which corresponds to the mean of the second Gaussian distribution of the GMM. φ Case 4: Uniform and constant sparseness λd,l = c = 0.92, which corresponds to the global mean of the converged adaptive . Case 5: Adaptive sparseness according to (39). In all cases, the regularization parameter for D is set as μd j τ = 0.5. The separation results in terms of the SDR, SAR, and SIR are tabulated in the Table I. In Table I, Case 1 shows that source separation without using sparseness has led to the poorest performance among all cases. When the sparsity parameter is set low in Case 2, the SDR and SAR performances are slightly better than Cases 3 and 4, whose sparsity parameters are set higher. Hence, relative to Case 2, it can be said that Case 1 is under-sparse while

713

Cases 3 and 4 are slightly over-sparse. Nevertheless, it is noted that the SIR performance in Case 2 is not better than Cases 1 and 4. This, therefore, suggests that although Case 2 delivers a good competitive result among all cases of uniform constant sparseness, its performance is not optimum. On the other hand, it is seen that Case 5, which is based on adaptive sparseness, has yielded the best performance in terms of SDR, SAR, and SIR, simultaneously. Based on the above analysis, it is clear that both low and high sparseness are necessary for the attainment of optimally sparse factorization. For example, Fig. 14 shows that although the majority of the element codes should be assigned high sparseness, there is a small percentage of codes (approximately 10%) that are to be assigned low sparseness. However, this situation raises a consequential issue since it is not possible to determine which of the particular element code should be assigned low or high sparseness. This poses a difficult problem in conventional SNMF2-D, which requires manual setting of the sparsity parameters. This, therefore, calls for the need to impose adaptive sparseness on each element code so that they may be individually and adaptively optimized. D. Comparison With Other NMF-Based SCSS Methods In this section, comparison of the proposed method with other regularization NMF-based source separation methods will be undertaken. These consist of the following methods. 1) NMF with Temporal Continuity and Sparseness Criteria [37] (NMF-TCS) is based on factorizing the magnitude spectrogram of the mixed signal into a sum of components, where the regularization incorporates the TCS into the separation framework. 2) NMFD with MUs based on Kullback–Leibler divergence minimization [38]. The comparison results are tabulated in Table II. In general, the above methods deliver good competitive results, especially in terms of the SIR. However, the best overall separation performance is still wielded by the proposed VRNMF2D. By analyzing the results, we may summarize the average improvement of the proposed method over the NMF-TCS and NMFD method as follows: 1) for music mixture, the average improvement is 2.6 and 2.3 dB per source, respectively; 2) for mixture of jazz music and speech, the improvement is 3.3 and 3 dB per source, respectively; and 3) for mixture of piano music and speech, the improvement is 2.4 and 2.7 dB per source, respectively. In percentage, this translates to an average improvement of 91% against the NMF-TCS and NMFD methods. The reasons can be attributed to the following. First, the basis obtained via NMF-TCS and NMFD methods are not adequate enough to capture the temporal dependency of the frequency patterns within the audio signal. Second, the NMFD is not unique. If the data does not span the positive octant adequately, a rotation of D and opposite H can give the same results [29]. In NMF-TCS, the sparsity parameter is set manually and therefore it is difficult to avoid under- or over sparse resolution of the factorization. On the other hand, our proposed method bypasses the above limitations thanks

714

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

TABLE II S EPARATION R ESULTS U SING D IFFERENT NMF M ETHODS

TABLE III S EPARATION R ESULTS B ETWEEN H ILBERT-BASED S UBSPACE M ETHOD AND THE P ROPOSED M ETHOD

Mixtures Jazz music and speech

Piano music and speech

Jazz music and piano music

Methods VRNMF2-D NMF-TCS NMFD VRNMF2-D NMF-TCS NMFD VRNMF2-D NMF-TCS NMFD

SDR 7.5 4.2 4.5 5.3 2.9 2.6 5.4 2.8 3.1

SAR 9.5 6.4 7.2 9.71 4.2 5.6 12.4 7.3 8.6

SIR 12 8.3 8.1 12 7.9 7.2 9.2 7.4 8.3

to the convolutive nature of the factorization coupled with the adaptive assignment of the sparsity parameters and the inclusion of the correlation among the different basis vectors which enforces the uniqueness of the factorization leading to higher accuracy in estimating the temporal information and frequency patterns of the audio signals. E. Comparison With Subspace Decomposition SCSS Method One of the latest methods for separating sources from a single channel without training knowledge is the Hilbertbased subspace method [39], which is the hybrid of Hilbert spectrum with empirical mode decomposition. By computing the spectral projections between the mixed signal and the individual intrinsic mode functions components, a set of spectral independent basis vectors can be obtained. These basis vectors are clustered into a given number of component sources. The Hilbert spectrum of the mixed signal is then projected onto the space spanned by each group of basis vectors yielding the independent source subspaces. Finally, the estimated time-domain signals are reconstructed by applying the inverse Hilbert transformation on the independent source subspaces. Table III summarizes the results between the Hilbert-based subspace method (Hilbert-SCSS) and the proposed VRNMF2-D method. In comparison, the average performance improvement of our method over the Hilbert-based subspace method can be summarized as follows: 1) for music mixture, the average SDR improvement is 1.5 dB per source; and 2) for mixture of speech and music signal, the SDR improvement is 2 dB per source. The performance of Hilbert-based subspace method relies too heavily on the derived frequency independent basis vectors, which are only stationary over time. Therefore, good separation results can only be obtained if basis vectors are statistical independent over time. In addition, the distinguishability of the corresponding amplitude weighting vectors also highly dependent on the independence of basis vectors. Since the frequency features are too similar, it becomes difficult to obtain the independent basis vectors. Hence, Table III shows a relatively poorer performance when separating mixture of speech and music signals. Compared with the Hilbertbased subspace method, the proposed method renders a more optimal part-based factorization. The factorization is unique under certain conditions (e.g., adaptive sparse and nonnegative component), making it unnecessary to impose constrains in

Mixtures

Methods

SDR

SAR

SIR

Jazz music and speech

VRNMF2-D Hilbert-SCSS

7.5 3.6

9.5 8.34

12 8.2

Piano music and speech

VRNMF2-D Hilbert-SCSS

5.3 3.2

9.71 7.23

12 7.1

Jazz music and piano music

VRNMF2-D Hilbert-SCSS

5.4 3.9

12.4 7.32

9.2 6.7

the form of statistical independence between original sources. Furthermore, the basis Dτ and sparse code Hφ in our proposed algorithm are derived separately and these are nonstationary over time, thus leading to more robust separation results compared to the stationary basis vectors obtained from the Hilbert-based subspace method. V. C ONCLUSION This paper presented a novel regularized nonnegative matrix 2-D factorization method. The impetus behind the proposed work is that sparseness achieved by the conventional SNMF2D is not efficient enough; in feature extraction and source separation, it is very necessary to yield control over the degree of sparseness explicitly for each element code. In addition, it does not incorporate correlation information between different basis vectors into the factorization process. The proposed method addresses the above and enjoys at least two significant advantages: first, the sparse regularization term is adaptively tuned to obtain the desired sparse decomposition, and second, the modified Gaussian prior is formulated to express the correlation between different basis vectors, thus enabling the features of the components to be extracted more efficiently. In the proposed algorithm, the prior distribution of H [e.g., (16)] and the approximate distribution [e.g., (26)] are different. This is an open problem and the future research will attempt to address this point in order to arrive at a more accurate sparse solution. A PPENDIX F ORMULATION OF THE U PDATE RULE FOR u p In (36)  ) * Qˆ P h P ln Qˆ P h P dh P  ) * = Qˆ P h p ln Qˆ P h p dh p p∈ P





1 exp −h p /u p − ln u p − h p /u p up p∈ P 0  ∞ h   p ln u p d =− exp −h p /u p up 0 p∈ P     ∞ hp hp d exp −h p /u p − up up p∈ P 0  ln u p + 1 (43) =−

=

p∈ P

dh p

GAO et al.: VARIATIONAL REGULARIZED 2-D NONNEGATIVE MATRIX FACTORIZATION

and



$ % Qˆ P h P ln Q P h P dh P  T  1 T = − dh P ChMAP − 2 D y + λ σ P ( 1 T h P + h P C P h P Qˆ P h P 2  1 3 4 =− C pm h p h m 2 p∈P,m∈M    3 4 1 T − hp ChMAP − 2 D y + λ σ p

(44)

p∈P

where  ·  denotes the expectation under Qˆ P (h P ) distribution such that h p h m  = u p u m and h p  = u p , which leads to  1 ˆ T − ln u p (45) min bˆ P u + uT Cu up 2 p∈P

T ˆ = CP + where bˆ P = (Ch M A P − σ12 D y + λ) P and C diag (C P ). The optimization of (45) can be accomplished be expanding (45) as follows: 

ˆ u˜ C   1 T p 2 ˆ G u, u˜ = b P u + up − ln u p . (46) 2 u˜ p p∈P

p∈P

Taking the derivative of G u, u˜ in (46) with respect to u and setting it to be zero, we have 

ˆ u˜ C 1 p u p + bˆ p − = 0. (47) u˜ p up The above equation is equivalent to the following quadratic equations: 

ˆ u˜ C p 2 u p + bˆ p u p − 1 = 0. (48) u˜ p R EFERENCES [1] W. T. Zhang and S. T. Lou, “Iterative algorithm for joint zero diagonalization with application in blind source separation,” IEEE Trans. Neural Netw., vol. 22, no. 7, pp. 1107–1118, Nov. 2011. [2] S. C. Alvarez, A. Cichocki, and L. C. Ribas, “An Iterative inversion approach to blind source separation,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1423–1437, Nov. 2000. [3] G. Zhou, Z. Yang, S. Xie, and J. Yang, “Online blind source separation using incremental nonnegative matrix factorization with volume constraint,” IEEE Trans. Neural Netw., vol. 22, no. 4, pp. 550–560, Nov. 2011. [4] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. NIPS, 2000, pp. 556–562. [5] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994. [6] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 550–563, Mar. 2010. [7] D. Lee and H. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

715

[8] D. Donoho and V. Stodden, “When does non-negative matrix factorization give a correct decomposition into parts?” in Proc. Adv. Neural Inf. Process. Syst., vol. 17. 2003, pp. 1–8. [9] N. Bertin, R. Badeau, and E. Vincent, “Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 538–5493, Mar. 2010. [10] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral decomposition for multiple pitch estimation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 528–537, Mar. 2010. [11] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., Oct. 2003, pp. 177–180. [12] Y. C. Cho and S. Choi, “Nonnegative features of spectro-temporal sounds for classification,” Pattern Recognit. Lett., vol. 26, no. 9, pp. 1327–1336, Jul. 2005. [13] M. D. Plumbley, “Algorithms for non-negative independent component analysis,” IEEE Trans. Neural Netw., vol. 14, no. 3, pp. 534–543, May 2003. [14] R. Zdunek and A. Cichocki, “Nonnegative matrix factorization with constrained second-order optimization,” Int. J. Signal Process., vol. 87, no. 8, pp. 1904–1916, 2007. [15] P. Sajda, S. Du, T. Brown, R. Stoyanova, D. Shungu, X. Mao, and L. Parra, “Nonnegative matrix factorization for rapid recovery of constituent spectra in magnetic resonance chemical shift imaging of the brain,” IEEE Trans. Med. Imag., vol. 23, no. 12, pp. 1453–1465, Dec. 2004. [16] C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Int. J. Neural Comput., vol. 19, no. 10, pp. 2756–2779, Oct. 2007. [17] O. Okun and H. Priisalu, “Unsupervised data reduction,” Int. J. Signal Process., vol. 87, no. 9, pp. 2260–2267, 2007. [18] R. Kompass, “A generalized divergence measure for nonnegative matrix factorization,” Int. J. Neural Comput., vol. 19, no. 3, pp. 780–791, 2007. [19] A. Cichocki, R. Zdunek, and S. I. Amari, “Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms,” in Proc. Int. Conf. Indepen. Comp. Anal. Blind Signal Separ., vol. 3889. Charleston, SC, Mar. 2006, pp. 32–39. [20] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Int. J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Aug. 2004. [21] K. Stadlthanner, F. J. Theis, C. G. Puntonet, and E. W. Lang, “Extended sparse nonnegative matrix factorization,” in Proc. Artif. Neural Netw., vol. 3512. 2005, pp. 249–256. [22] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” Int. J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998. [23] M. J. L. Orr, “Regularization in the selection of radial basis function centers,” Int. J. Neural Comput., vol. 7, no. 3, pp. 606–623, May 1995. [24] A. T. Cemgil, “Bayesian inference for nonnegative matrix factorization models,” Int. J. Comput. Intell. Neurosci., vol. 2009, p. 785152, Jan. 2009. [25] S. Moussaoui, D. Brie, A. Mohammad-Djafari, and C. Carteret, “Separation of non-negative mixture of non-negative sources using a Bayesian approach and MCMC sampling,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4133–4145, Nov. 2006. [26] M. N. Schmidt, O. Winther, and L. K. Hansen, “Bayesian non-negative matrix factorization,” in Proc. 8th Int. Conf., Paraty, Brazil, 2009, pp. 540–547. [27] R. Salakhutdinov and A. Mnih, “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 880–887. [28] M. Mørup and K. L. Hansen, “Tuning pruning in sparse non-negative matrix factorization,” in Proc. 17th Eur. Signal Process. Conf., Glasgow, Scotland, 2009, pp. 1–5. [29] M. Morup and M. N. Schmidt, “Sparse non-negative matrix factor 2-D deconvolution,” Inf. Math. Model., Univ. Denmark, Lyngby, Denmark, Tech. Rep., 2006. [30] M. N. Schmidt and M. Morup, “Nonnegative matrix factor 2-D deconvolution for blind single channel source separation,” in Proc. Int. Conf. Indepen. Comp. Anal. Blind Signal Separat., vol. 3889. 2006, pp. 700– 707.

716

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 5, MAY 2012

[31] M. Harva and A. Kaban, “Variational learning for rectified factor analysis,” Signal Process., vol. 87, no. 3, pp. 509–527, 2007. [32] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA: Athena Scientific, 1999. [33] Signal Separation Evaluation Campaign (SiSEC 2008). (2008) [Online]. Available: http://sisec.wiki.irisa.fr [34] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Speech, Audio Lang. Process., vol. 14, no. 4, pp. 1462–1469, Jul. 2005. [35] D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, P. Divenyi, Ed. Norwell, MA: Kluwer, 2005, pp. 181–197. [36] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257– 286, Feb. 1989. [37] T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [38] P. Smaragdis, “Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs,” in Proc. 5th Int. Conf. Indepen. Comp. Anal. Blind Signal Separat., Grenada, Spain, Sep. 2004, pp. 1–8. [39] M. K. I. Molla and K. Hirose, “Single-mixture audio source separation by subspace decomposition of Hilbert spectrum,” IEEE Trans. Audio, Speech Lang. Process., vol. 15, no. 3, pp. 893–900, Mar. 2007.

Bin Gao received the B.S. degree in communications and signal processing from Southwest Jiaotong University, Chengdu, China, in 2005, the M.Sc. degree (with distinction) in communications and signal processing and the Ph.D. degree from Newcastle University, Newcastle upon Tyne, U.K., in 2007 and 2011, respectively. His research topic is single channel blind source separation under the supervision of Prof. Woo and Prof. Dlay. He is currently a Research Associate with Newcastle University. His current research interests include audio and image processing, machine learning, structured probabilistic modeling on audio applications such as audio source separation, feature extraction, and denoising.

W. L. Woo (M’08) was born in Malaysia. He received the B.E. degree (first class hons.) in electrical and electronics engineering and the Ph.D. degree from Newcastle University, Newcastle upon Tyne, U.K. He is currently a Senior Lecturer with the School of Electrical, Electronics and Computer Engineering, Newcastle University. He has an extensive portfolio of relevant research, which were supported by a variety of funding agencies. He has published over 250 papers on these topics on various journals and international conference proceedings. He acts as a Consultant to a number of industrial companies that involve the use of statistical signal and image processing techniques. His current research interests include mathematical theory and algorithms for nonlinear signal and image processing, areas of machine learning for signal processing, blind source separation, multidimensional signal processing, and signal/image deconvolution and restoration. Dr. Woo is a member of the Institution Engineering Technology. He received the IEE Prize and the British Scholarship in 1998. He serves on the editorial board of the many international signal processing journals. He actively participates in international conferences and workshops, and serves on their organizing and technical committees.

S. S. Dlay received the B.Sc. (hons.) degree in electrical and electronic engineering and the Ph.D. degree in very large-scale integration design from Newcastle University, Newcastle upon Tyne, U.K. He was a Post-Doctoral Research Associate with Newcastle University in 1984, and helped to establish an Integrated Circuit Design Centre, funded by the Engineering and Physical Science Research Council (EPSRC). In 1984, he was a Lecturer with the Department of Electronic Systems Engineering, University of Essex, Colchester, U.K. In 1986, he re-joined Newcastle University as a Lecturer with the School of Electrical, Electronic and Computer Engineering, in 2001, and later was promoted to Senior Lecturer. He has published over 250 research papers ranging from biometrics and security, biomedical signal processing, and implementation of signal processing architectures. Prof. Dlay is a college member of the EPSRC. He has been appointed a Personal Chair in Signal Processing Analysis. He received a Scholarship at EPSRC and the Charles Hertzmann Award. He serves on many editorial boards and has played an active role in numerous international conferences in terms of serving on technical and advisory committees as well as organizing special sessions.

Variational regularized 2-D nonnegative matrix factorization.

A novel approach for adaptive regularization of 2-D nonnegative matrix factorization is presented. The proposed matrix factorization is developed unde...
958KB Sizes 0 Downloads 3 Views