Automated graph regularized projective nonnegative matrix factorization for document clustering.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

1821

Automated Graph Regularized Projective Nonnegative Matrix Factorization for Document Clustering Xiaobing Pei, Tao Wu, and Chuanbo Chen

Abstract—In this paper, a novel projective nonnegative matrix factorization (PNMF) method for enhancing the clustering performance is presented, called automated graph regularized projective nonnegative matrix factorization (AGPNMF). The idea of AGPNMF is to extend the original PNMF by incorporating the automated graph regularized constraint into the PNMF decomposition. The key advantage of this approach is that AGPNMF simultaneously finds graph weights matrix and dimensionality reduction of data. AGPNMF seeks to extract the data representation space that preserves the local geometry structure. This character makes AGPNMF more intuitive and more powerful than the original method for clustering tasks. The kernel trick is used to extend AGPNMF model related to the input space by some nonlinear map. The proposed method has been applied to the problem of document clustering using the well-known Reuters-21578, TDT2, and SECTOR data sets. Our experimental evaluations show that the proposed method enhances the performance of PNMF for document clustering. Index Terms—Clustering, projective nonnegative matrix factorization.

I. Introduction

N

ONNEGATIVE MATRIX FACTORIZATION (NMF) proposed by Lee and Seung [1], [2], decomposes a matrix X into two nonnegative low rank matrices W(basis matrix) and H(encoding matrix), such that X ≈ WH, which is useful in finding basis information of nonnegative data. This is an emerging technique of dimensionality reduction. Recently, there are also some attempts to extend and apply the method to diverse fields of science, such as face and object recognition [3]–[5] biomedical applications [6], clustering and classification [7], [8], and so on. For some tasks, many authors have proposed extending the NMF algorithms with additional constraints placed on W, and H. Zafeiriou et al. [9] and Kotsia et al. [10] incorporated the discriminant information inside the NMF decomposition with application to facial image processing problem. Wang et al. [11], Xue et al. [12], and Buciu and Manuscript received June 19, 2013; revised December 10, 2013; accepted December 18, 2013. Date of publication January 14, 2014; date of current version September 12, 2014. This work is supported by the National Natural Science Foundation of China under Grant 51175197. This paper was recommended by Associate Editor P. P. Angelov. The authors are with the School of Software, HuaZhong University of Science and Technology, Wuhan, Hubei 430074, China (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2296117

Pitas [13] proposed formulations that enforce constraints based on Fisher linear discriminant analysis for improved determination of spatially localized features. In [14], a technique for imposing additional constraints to the NMF minimization algorithm has been proposed. This technique, the so-called local nonnegative matrix factorization (LNMF), is an extension of NMF, and gives even more localized bases. Hoyer [15] proposed a method to combine sparse coding constraint into the original NMF, which enforces a statistical sparsity of the H matrix. As the sparsity of H increases, the basis vectors become more localized, i.e., the parts-based representation of the data in W becomes more and more enhanced. Hoyer [16] showed that how explicitly incorporating the notion of sparseness improves the found decompositions. Zhan et al. [17] presented an extension of the NMF by imposing an orthogonality constraint on the basis matrix, and controlling the sparseness of the coefficient matrix for robust learning of compact local part-based representation of face images. O’Grady and Pearlmutter [18] presented an extension to NMF that is convoluting, includes a sparseness constraint, and shows that the addition of a sparseness constraint can lead to the discovery of appropriate over-complete representations in music. In recent years, more attention has been given to NMF for its application to clustering. Xu et al. [19] used NMF for document clustering and reported superior performance. Ding et al. [20] proposed the convex NMF (CNMF), to restrict the feature basis to be the convex combination of the samples. A variant of CNMF is proposed [20], which is called cluster-NMF. Yuan and Oja [21] and Yang and Oja [22] proposed a variant of NMF, which is called projective NMF (PNMF). The PNMF approximately factorizes a projection matrix, minimizing the reconstruction error, into a positive low-rank matrix and its transpose. However, these presented PNMF methods do not use the intrinsic structure information of original data. Ding et al. [23] presented a systematic analysis of three-factor NMF for clustering, and provided new updating rules. Ding et al. [24] presented a systematic analysis and extensions of NMF and showed that NMF is equivalent to Kernel K-means clustering and spectral clustering. Ding et al. [25] proved that the KL-divergence based NMF is identical to the probabilistic LSI. Also, these presented methods do not use the intrinsic structure information of original data.

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1822


In this paper, a novel PNMF method for enhancing the clustering performance is presented, called automated graph regularized projective nonnegative matrix factorization (AGPNMF), which is to extend the original PNMF by incorporating the automated graph regularized constraints into the PNMF model. Similar to most dimensionality reduction methods, AGPNMF finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The constraint of finding both neighbors and weights is too strong for constructing the data weights matrix. Our goal is to find both neighbors and weights automatically by incorporating the constraint of finding both neighbors and weights into the PNMF optimization. The key advantage of this approach is that AGPNMF simultaneously finds graph weights matrix and dimensionality reduction of data. This method enforces the locality preserving property by incorporating the locality preserving constraints into the original PNMF decomposition. Thus, the representation space with locality preserving property is obtained from the given data sets using AGPNMF. The kernel trick is used to generalize AGPNMF model to arbitrary nonlinear similarity function. The proposed method has been applied to the problem of document clustering using the well-known Reuters-21578, TDT2, and SECTOR data sets. Our experimental evaluations show that the proposed method enhances the performance of PNMF for document clustering. The remainder of this paper is organized as follows. In Section II we give a brief description of numerical approaches for the solution of the nonnegative matrix factorization problem. The automated graph regularized PNMF algorithm is presented in Section III. Section IV discusses the KernelAGPNMF model. The proposed method is applied to the problem of document clustering and the related experiment is described in Section V. Finally, conclusions are drawn in Section VI.

II. NMF Revisited NMF as a linear and non-negative data representation technique, imposes a nonnegative constraint on learning basis vectors and encoding vectors as part of a subspace approach. Suppose nonnegative data matrix X = [x1 , x2 , ..., xn ] ∈ Rm×n , consists of n data points xi , each with dimensionality m. Given a predetermined positive integer r < min(n,m), the data matrix X is then approximately factorized into two matrices: m×r basis matrix W (W ≥ 0) and n×r coefficient matrix H(H ≥ 0). That is, NMF finds two nonnegative matrices m×r matrix W and n×r matrix H such that the product WH T approximate to X, namely X ≈ WH T . W and H are smaller than the original matrix X, which results in a compressed version of the original data matrix X. A. NMF Method In order to find an approximate factorization X ≈WH T , the optimal choices of matrices W and H are defined as those nonnegative matrices that minimize the Euclidean distance

between X and . The NMF method was formulated in [1], [2] MinW,H f (W, H) = ||X − WH T ||2 Subject to W≥0, H≥0.

(1)

A popular approach, proposed by Lee and Seung [2], is the multiplicative update algorithm for problem (1). The Lee and Seung NMF algorithm successively updates W ≥ 0 and H≥0, which fixes the other, by taking a step in a certain weighted negative gradient direction for the function f (W, H) as defined in (1). For solving the optimization problem (1), the update rules, presented by Lee and Seung in [2], is (XT W)ij (HW T W)ij (XH T )ij Wij = Wij (WHH T )ij . Hij = Hij

(2) (3)

Once starting from the positive pair (W,H), these updates will always maintain positive in the subsequent iterates. In addition, Lee and Seung [2] proved that the update causes the function value to be nonincreasing, namely for any two successive iterations t and t + 1 f (W t+1 , H t+1 )≤f (W t , H t ) where strictly inequalities hold unless the gradient of f (W, H) vanishes. B. PNMF Method Yuan and Oja [21] and Yang and Oja [22] introduced PNMF, which approximately factorizes a projection matrix, minimizing the reconstruction error, into a positive low-rank matrix and its transpose. The optimization model of PNMF is given by MinW ||X − WW T X||2 , Subject to W≥0 Or

(4)

MinW ||X − WH T ||2 , Subject to H = XT W, W≥0. (5) The optimization problems (4) and (5) mainly take account of the basis matrix W in order to obtain the projection matrix. Since NMF model in this paper attempts to address a clustering problem, we should take account of the coefficient matrix H. Then, the optimization problem (5) is rewritten as follows (please see the experiments code given in [22]): MinH ||X − WH T ||2 , Subject to W = XH, H≥0.

(6)

We rewrite the constraint W = XH as follows: wl = hl1 x1 + .... + hln xn (l = 1, 2, ..., r).

(7)

Then we could interpret the constraint (7) in the following way. The columns wl is convex combinations of the columns of input data X; in particular, these columns would capture a notion of centroids for clustering. Suppose now that we interpret the entries of H as posterior cluster probabilities (cluster indicators), it motivates a factorization model [20] MinH ||X − XHH T ||2 , Subject to H≥0.

(8)

PEI et al.: AUTOMATED GRAPH REGULARIZED PROJECTIVE NONNEGATIVE MATRIX FACTORIZATION

The optimization model (8) is also called cluster-NMF [20] because the degree of freedom in this factorization is the cluster indicator H.

III. PNMF with Automated Graph Constraints Recall that PNMF and cluster-NMF try to find posterior cluster probabilities by approximately factorizing a matrix into a positive low-rank matrix and its transpose [20], [22]. However, these presented PNMF and cluster-NMF methods do not use the intrinsic structure information of original data. One might hope that knowledge of the local geometric structure of the data can be exploited for better discovery of these posterior cluster probabilities. In this section, we will propose an automated graph regularized PNMF algorithm, which incorporates the automated graph regularized constraint into the projective NMF decomposition. A. Derivation of AGPNMF Model 1) Formulation for the Local Scatter: Belkin and Niyogi [26] presented a new geometrically motivated algorithm for reconstructing a representation for data sample from a low dimensional manifold embedded in a high dimension space. The algorithm is very simple, having a few local computations and one sparse eigenvalue problem. The algorithm provides a computationally efficient approach to non-linear dimensionality reduction that has low-dimensional, neighborhood-preserving character and a natural connection to clustering [26]. Since PNMF model in this section attempts to address a clustering problem, we should take account of the local scatter of samples. The local scatter can be characterized by: if two samples xi , xj are close in the intrinsic geometry of the data distribution, then hi and hj , the cluster indicator of this two samples, are also close to each other. Specifically, two data xi and xj are viewed within a local-neighborhood provided that the set of l-nearest neighbors of xi or the set of l-nearest neighbors of xj . After perform PNMF, we get the cluster indicator hi and hj for samples xi and xj . The local scatter is then defined by n 1 JL = ||zi − zj ||2 Rij 2 i,j=1

= Tr(H T DH) − Tr(H T RH) = Tr(H T LH)

(9)

where zj = [hj1 , hj2 , ..., hjr ]T and R is the local scatter weights matrix, Tr(·) denotes the trace of a matrix and D is a diagonal matrix whose entries are column sums of R Djj = Rjl , L = D − R. l

In order to assure that the cluster indicator of data X can preserve the local scatter structure of the original samples, we should let the objective function JL as small as possible, i.e., if instances are involved by R, then their cluster indicator in the represents space are close as well.

1823

2) Formulation for Learning Local Scatter Weights Matrix R: Roweis and Saul [27] presented a nonlinear dimension reduction algorithm, called locally linear embedding (LLE), which is a geometrically motivated algorithm for reconstructing representation for data sample from a low dimensional manifold embedded in a high dimension space. For data X, LLE expects each data point and its neighbors to lie on or close to a locally linear manifold that governs how the weight coefficients S are constructed from [27] (10) ||xi − Sij xj ||2 MinS i

j∈Ni

where weight matrix S summarizes the contribution of the j th data point to the construction of i th data point, Ni is the k-NN neighborhood of xi . The original LLE in (10) maintains that the weight matrix S preserves the k-NN neighborhood structure, i.e., Sij = 0 for only j ∈ Ni . This constraint is too strong for constructing the data weights matrix S. Thus, Kong et al. [28] presented a new approach to learn the pairwise similarity matrix S by relaxing constraints to let S be nonzero even j ∈ / Ni , where Sij represents the i th data’s contribution to reconstruct data point xj . The approach for learning S is [28] MinS ||X − XS||2 + αTr(S T S) + β||S||1 Subject to S≥0, α≥0, β≥0

(11)

where α and β are regularization parameters, ||S||1 = ij |Sij |. The approach for learning similarity matrix bypasses kNN entirely and automatically computes S through adding regularized terms Tr(S T S) and ||S||1 . The local scatter weights matrix R in objective function in (9) is also too strong for preserving the k-NN neighborhood structure. We relax this to let R be nonzero by the same idea in [28]. In other words, the local scatter weights matrix R could be given by the solution of optimization problem (11) [28]. In this paper, we set that R is symmetrical, and we obtain the following: R=

S + ST . 2

(12)

3) AGPNMF Model: In order to enhance the locality preserving properties for PNMF or cluster-NMF decomposition, we combine the goal of minimum approximation error with the locality preserving constraints. This way, a modified PNMF or Cluster-NMF can be constructed that is derived from the minimization of the local scatter criterion MinH OF (W, H) = ||X − WH T ||2 + γ Tr(H T LH) (13) Subject to H≥0, γ≥0, W = XH. (14) On the other hand, we combine the optimization problems (11), (13), and (14) into one, in order to simultaneously find graph weights matrix and dimensionality reduction of data. Hence, we formulate a novel cost function, which must be minimized with respect to both S and H MinH,S OF (W, H, S) = ||X − WH T ||2 + γ Tr(H T LH) +||X − XS||2 + αTr(S T S) + β||S||1

(15)

1824


subject to H≥0, S≥0, γ≥0, α≥0, β≥0, W = XH

(15 )

where γ, α, β are constants. The first term of the objective function (15) measures approximative error between the original data and the reconstructions WH T from the estimated cluster indictor space H. The second term controls the degree of local geometry structure characters among samples. The third term is used to minimize the reconstruction error from the original data. The forth term penalizes the complexity of S. The final term is to promote the sparsity of S. The three parameters γ, α, and β balance the five terms in the objective function. Similar to most dimensionality reduction methods, AGPNMF finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key advantage of this approach is that AGPNMF simultaneously find graph weights matrix and dimensionality reduction of data. B. Learning Rules The problem in (15) and (15’) is not a convex optimization problem. Following the same MM algorithm, similar to the one used for NMF [2], DNMF [9], LNMF [14], PNMF [22], and ILLE [28], we come up with the following updating rules for Sik and Hik as:

methods for NMF [29]–[31] and random generation. For NMF learning, different stopping criteria have been exploited [32], [33]. The most commonly used are the maximum iteration number and the error tolerance. In our experiment, the initialization value of H and S will be randomly initialized, and we include an iteration limit specified by the prescribed maximum iteration number. Theorem 1: The objective function OF (W, H, S) in (15) is non-increasing under the update rules in (16) and (17). The objective function is invariant under these updates if and only if H and S are at a stationary point. A detail proof of the theorem is given in the Appendix.

IV. Kernel-AGPNMF Kernel methods are utilized in many different research fields, such as noisy interpolation and pattern recognition [34]–[36]. In this section, we use the kernel trick to generalize AGPNMF model to arbitrary nonlinear similarity function. Consider a mapping φ(X) to map data X to a high dimensional space in kernel machine. Substituting (15’) into (15), (15) becomes MinH,S ||φ(X) − φ(X)HH T ||2 + γ Tr(H T LH) +||φ(X) − φ(X)S||2 + αTr(S T S) + β||S||1 .

(18)

Sik = Sik

Indeed, it is easy to see that the minimization objective

(XT X + 21 γHH T )ik

(XT XS + αS)ik + 41 γ((HH T )ii + (HH T )jj ) + 2 −Bik + Bik − 4Aik Cik Hik = Hik 2Aik

(16) β 2

||φ(X) − φ(X)HH T ||2 + γ Tr(H T LH) +||φ(X) − φ(X)S||2 + αTr(S T S) + β||S||1

(17)

where

= Tr(φ(X)T φ(X) − 2φ(X)T φ(X)HH T ) +Tr(H T φ(X)T φ(X)HH T H) T

Aik = (HH T XT XH + XT XHH T H + (γD)H)ik Bik = −2(XT XH)ik Hik and 1 Cik = − γ(RH)ik Hik2 . 2 The derivation of iterative rules (16) and (17) is given in the Appendix. The above decomposition is unsupervised nonnegative matrix factorization method that decomposes the data into low-dimensional representation space, enhancing the locality preserving characteristics among the data collection. C. AGPNMF Algorithm In summary of the above description, the following provides the AGPNMF algorithm. 1) Initialize matrix S, H. 2) Repeat the following steps until maximum iteration number: a) fixing H, updating S by updating rule (16); b) fixing S, updating H by updating rule (17). How to choose or seed the initial matrices H and S, is a problem in AGPNMF algorithm. There are many initialization

(19)

T

+γTr(H LH) + Tr(φ(X) φ(X) −2φ(X)T φ(X)S + S T φ(X)T φ(X)S) +αTr(S T S) + β||S||1 . The important thing here is that the exact form of the mapping function is not needed, rather, only the inner product is required. In other words, the objective function depends only on the kernel K = . In fact, the update rules for AGPNMF presented in (16) and (17) depend on XT X only. Thus, it is possible to kernelize AGPNMF. It is easy to obtain the update rules for KernelAGPNMF (KAGPNMF) Sik = Sik

(K + 21 γHH T )ik (KS + αS)ik + 41 γ((HH T )ii + (HH T )jj ) + 2 −Bik + Bik − 4Aik Cik Hik = Hik 2Aik

β 2

where K is a kernel Aik = (HH T KH + KHH T H + (γD)H)ik 1 Bik = −2(KH)ik Hik and Cik = − γ(RH)ik Hik2 . 2

(20)

(21)


1825

TABLE I Clustering Results Comparison on TDT2

TABLE III Clustering Results Comparison on SECTOR

TABLE II Clustering Results Comparison on TDT2 TABLE IV Clustering Results Comparison on SECTOR

V. Experiment Results In this section, we investigate the use of our proposed AGPNMF for document clustering. Several experiments are carried out to show the effectiveness of our algorithm for document clustering.

TABLE V Clustering Results Comparison on Reuters-21578

A. Data Sets For our experiments, we use three well-known data sets TDT2 corpus,1 Reuters-21578 corpus,2 and SECTOR corpus,3 which have been widely used by the researchers in the information retrieval area. The document set Reuters-21578 corpus2 is an archive of manually categorized newswire stories from Reuters Ltd. The classic Reuters-21578 collection was the main benchmark for document clustering evaluation. The TDT2 English Corpus1 has been designed to include six months of material drawn on a daily basis from six English news sources, including two newswires (APW, NYT), two radio programs (VOA,PRI), and two television programs (CNN, ABC). In our experiments, we used the preprocessed data set TDT2 and Reuters-21578 given by Cai et al. [37],4 and then those documents that are associated with more than one class or have less than 50 texts, are removed in our experiments. The Industry Sector data3 is a collection of corporate web pages organized into categories based on what a company 1 http://www.nist.gov/speech/tests/tdt/tdt98/index.htm 2 http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. 3 http://www2.cs.cmu.edu/∼TextLearning/datasets.html 4 http://www.cad.zju.edu.cn/home/dengcai/Data/data.html

TABLE VI Clustering Results Comparison on Reuters-21578

1826

Fig. 1.

Fig. 2.

Fig. 3.


Purity versus number of clusters on the TDT2. Fig. 4.

Entropy versus number of clusters on the SECTOR.

Fig. 5.

Purity versus number of clusters on the Reuters-21578.

Fig. 6.

Entropy versus number of clusters on the Reuters-21578.

Entropy versus number of clusters on the TDT2.

Purity versus number of clusters on the SECTOR.

produces or does. In our experiments, we used the preprocessed data set given by Lin,5 which has been scaled and normalized to unit length. And then those documents which are associated with more than one class or have less than 50 texts are removed in our experiments. Each document vector is normalized. B. Evaluation Methods of Clustering We use two measures to evaluate the accuracy of the clustering algorithms. The first measure is the purity that represents the fraction of the cluster corresponding to the largest class of documents assigned to that cluster. The purity of the cluster r is defined as [22] Purity(k) = max(nlk ) l

5 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools

where nlk is the number of samples in the cluster k that belongs to original class l. The overall purity of the clustering solution is obtained as a weighted sum of the individual cluster purities and is given by 1 Purity = Purity(k). (23) n k In general, the larger the purity value is, the better the clustering result is [22]. The second measure is the entropy that represents how classes are distributed on various clusters. Following [22] and [38], the entropy the entire clustering solution is computed as

(22) Entropy = −

1 nl nlk log2 k n log2 q k=1 l=1 nk

(24)


1827

Fig. 10. Performance (entropy) of AGPNMF versus the parameter γ with different weighting schemes on three document sets.

Fig. 7. Performance (purity) of AGPNMF versus the parameter α with different weighting schemes on three document sets.

Fig. 11. Performance (purity) of AGPNMF versus the parameter β with different weighting schemes on three document sets. Fig. 8. Performance (entropy) of AGPNMF versus the parameter α with different weighting schemes on three document sets.

Fig. 12. Performance (entropy) of AGPNMF versus the parameter β with different weighting schemes on three document sets. Fig. 9. Performance (purity) of AGPNMF versus the parameter γ with different weighting schemes on three document sets.

where nk = l nlk . Generally, the smaller the entropy value is, the better the clustering quality is [22], [38]. C. Comparison of Document Clustering Algorithm In this section, the performance of clustering is evaluated on Reuters-21578, TDT2, and SECTOR document databases. Our experiment in this subsection has two parts. First, to show the document clustering performance, we compare the clustering performance of AGPNMF with four methods, namely the k-means [39], NMF [2], CNMF [20], and PNMF [22]. In the following experiments, the parameters γ, α and β are set to 0.01, 0.01, and 0.01, respectively. Tables I–VI and Figs. 1–6 show the experimental results on the Reuters-21578, TDT2, and SECTOR document databases, respectively. Tables I–VI and Figs. 1–6 show the entropy and purity values of the clusters obtained by running five clustering algorithms on different data sets when cluster numbers ranging from two to ten are selected. The evaluations are conducted by

running three times and the average performance is recorded over these three times. From these experimental results, we can draw the following conclusions. Incorporating the automated graph regularized constraint inside PNMF formulation can enhance the performance of PNMF in document clustering problem, and can perform better than the k-means-based clustering, NMF-based clustering, CNMF-based clustering, and PNMF-based clustering in SECTOR, TDT2, and Reuters21578 corpus. Second, in our AGPNMF-based document clustering, one needs to select the value of the regularization parameters γ, α and β. The parameters γ, α and β are explored in the experiment. The experiment results are shown in Figs. 7–12, which show the relationship between the clustering performance and the value of γ, α and β. Here, the evaluations are conducted for the number of clusters ranging from two to six, and the average performance is recorded over these five clusters. The values of γ, α and β varied from 0.001 to 1. The results of the experiment indicate that the best purity of AGPNMF-based clustering have been obtained when choosing values α ranging

1828


from 0.01 to 0.5 in TDT2, SECTOR, and Reuters-21578 corpus, γ and β ranging from 0.01 to 0.1 in TDT2, SECTOR, and Reuters-21578 corpus, the best entropy values of AGPNMFbased clustering have been obtained when choosing values α ranging from 0.01 to 0.5 in TDT2, SECTOR, and Reuters21578 corpus, γ and β ranging from 0.01 to 0.1 in TDT2, SECTOR, and Reuters-21578 corpus.

VI. Conclusion This paper addressed a novel presentation spaces extraction method for term-frequency matrix of the given document corpus. Based on the knowledge of the geometric structure of the samples, we have presented a novel projective nonnegative matrix factorization algorithm, incorporating the automated graph regularized constraint into the PNMF decomposition, which has locality preserving properties. The key advantage of this approach is that AGPNMF simultaneously finds graph weights matrix and dimensionality reduction of data. The proposed method has also been applied to document clustering using three data sets Reuters-21578 corpus, TDT2 corpus, and SECTOR corpus. The results of the experiment indicate that our proposed method enhance the performance of PNMF for the document clustering problem. As with any newly developed techniques, there exist some important issues that need further research. For example, methods for choosing the parameters of AGPNMF should be investigated. In addition, efficient learning algorithms should be investigated. Appendix To prove Theorem 1, we will follow the similar procedure described in [2], [9], [14], [20], and [22]. Our proof will make use of an auxiliary function similar to that used in the MM algorithm [1], [2], [40], [41]. We first introduce a definition, lemma 1 presented in [2] and Propositions 1–3 presented in [22]. Definition 1: G(h, h ) is an auxiliary function for F (h) if the conditions G(h, h )≥F (h), G(h, h) = F (h) are satisfied. The auxiliary function is a useful concept because of the following lemma. Lemma 1: If G(h, h ) is an auxiliary function of F (h), then F (h) is non-increasing under the update [2] h(t+1) = argh min G(h, ht ).

(25)

Proof: By definition 1, it is easy to conclude that F (h(t+1) )≤G(h(t+1) , h(t) ). By the optimization problem (25) and definition 1, we have (t+1)

G(h

(t)

(t)

(t)

(t)

, h )≤G(h , h ) = F (h ).

m×r Proposition 1: For any matrices A ∈ Rr×r + ,B ∈ R+ , and m×r B ∈ R+ , it holds (BA)ik B 2 ik Tr(B T B A)≤ (26) B . ik ik

m×r Proposition 2: For any matrices A ∈ Rr×r + , B ∈ R+ , and m×r B ∈ R+ , it holds Bik Bil T . (27) Akl Bik Bil 1 + log Tr(B B A)≥ Bik Bil ikl

m×r Proposition 3: For any matrices A ∈ Rm×r + ,B ∈ R+ , and m×r B ∈ R+ , we have 2 2 Bik + Bik T Aik . (28) Tr(A B )≤ 2Bik ik

Proof of Theorem 1: The generalized objective of objective function (15) is ˜ F (W, H, S) Q = ||X − WH T ||2 + γ Tr(H T LH) (29) +||X − XS||2 + αTr(S T S) + β||S||1 T +Tr( (W − XH)) by introducing Lagrangian multipliers {ψij }. We denote L1F (H) = ||X − WH T ||2 + γ Tr(H T LH) +Tr(T (W − XH)) = Tr(−2XT WH T − T XH) +Tr(W T WH T H) +γTr(H T (D − R)H) +Tr(XT X + T W) L2F (S) = Tr(H T LH) +||X − XS||2 + αTr(S T S) + β||S||1 = γTr(H T (D − R)H) +Tr(XT X − 2XT XS + S T XT XS) +αTr(S T S) + β||S||1 .

F (h(t+1) )≤F (h(t) ).

(31)

Now we will show that the update step for H in (17) is exactly the update in (25) with a proper auxiliary function. We have the following lemma. Lemma 2: Let the function G1F (H, H ) be defined as G1F (H, H ) = Tr(−2XT WH − T XH ) (HW T W)ik H 2 ik + H ik ik Hik2 + H 2ik T −Tr(A H) + Aik 2Hik ik T

+γ

(DH)ik H 2 ik

ik

Therefore

(30)

−γ

ikl T

Hik

Rkl Hik Hil

+Tr(X X + T W)

H H 1 + log ik il Hik Hil

(32)


then G1F (H, H ) is an auxiliary function of L1F (H ). Here we denote A = 2XXT HH T H for notational brevity. By Propositions 1–3, it is easy to conclude that

Equation (35) can be expanded as (−2XT W − XT )ik Hik H ik + 2(HW T W)ik H ik 2

−Aik Hik H ik + Aik H ik + 2γ(DH)ik H ik 2

G1F (H, H )≥L1F (H ), and G1F (H , H ) = L1F (H ).

G2F (S, S ) = Tr(XT X − 2XT XS − γH T RH) (XT XS)ik S 2 (S)ik S 2 ik ik + +α S S ik ik ik ik S2 + S2 ik ik +β 2S ik ik ((HH T )ii + (HH T )jj )(S 2 + S 2 ) ik ik +γ (33) 4Sik ik then G2F (S, S ) is an auxiliary function of L2F (S ) By Proposition 1 and Proposition 2, it is easy to conclude that G2F (S, S )≥L2F (S ) and G2F (S , S ) = L2F (S ). Therefore the function G2F (S, S ) is an auxiliary function of L2F (S ). With the help of the auxiliary functions G1F (H, H ) and G2F (S, S ), the update rules for H and S can be derived by minimizing G1F (H, H ) and G2F (S, S ), as defined in (32), (33), [2], [9], [14], [20], and [22]. Solving the optimization problem H = argH min G1F (H, H ), the update rules are derived from setting ∂G1F (H,H ) to zero for all the Hik .We need to calculate the ∂H ik

∂G1F (H,H ) ∂Hik

is given by

∂G1F (H, H ) = (−2XT W − XT )ik ∂Hik (HW T W)ik Hik H +2 − Aik + Aik ik Hik Hik (DH)ik Hik Hik +2γ − γ(RH)ik Hik Hik

(34)

2

−γ(RH)ik Hik2 = 0 ⇔ 2 (2HW T W + 2γDH + A)ik H ik

Therefore the function G1F (H, H ) is an auxiliary function of L1F (H ). Similarly, we have a lemma for S. Lemma 3: Let the function G2F (S, S ) be defined as

partial derivatives

1829

+(−2XT W − XT − A)ik Hik H ik −γ(RH)ik Hik2 .

(36)

Therefore, the update rules can be derived as −T2 + T22 − 4T1 T3 Hik = 2T1

(37)

where T1 = (2HW T W + 2γDH + A)ik , T3 = −γ(RH)ik Hik2 and T2 = (−2XT W − XT − A)ik Hik . The Lagrangian multipliers can be determined by using the KKT condition. According to ∂L1F (H) = −2XH + 2WH T H + = 0 W

(38)

one obtains = 2XH − 2WH T H

(39)

X = 2X XH − 2X WH H. T

T

T

T

(40)

Substituting (15’) and (40) into (37), the update rule becomes identical to (17). Since G1F (H, H ) is an auxiliary function of L1F (H ), L1F (H ) is non-increasing under this update rules (37). Similarly, the matrix S is updated by minimizing the auxiliary functions G2F (S, S ). Solving the optimization problem S = argS min G2F (S, S ), ) to zero for the update rules are derived from setting ∂G2F∂S(S,S ik all the Sik .We need to calculate the partial derivatives ∂G2F (S, S ) ∂Sik is given by

Setting ∂G1F (H, H ) =0 ∂Hik

∂G2F (S, S ) = (−2XT X − γHH T )ik ∂Sik (XT XS)ik Sik (S)ik Sik +2 + 2α Sik S ik Sik ((HH T )ii +(HH T )jj )Sik +γ +β . 2Sik Sik

we have

∂G1F (H, H ) = (−2XT W − XT )ik ∂Hik (HW T W)ik Hik H +2 − Aik + Aik ik Hik Hik (DH)ik Hik Hik +2γ − γ(RH)ik = 0. Hik Hik

(35)

Setting ∂G2F (S, S ) =0 ∂Sik

(41)

1830


we have ∂G2F (S, S ) = (−2XT X − γHH T )ik ∂Sik (XT XS)ik Sik (S)ik Sik S +2 + 2α + β( ik ) Sik Sik Sik ((HH T )ii + (HH T )jj )Sik +γ = 0. 2Sik

(42)

Equation (42) can be expanded as 2(−2XT X − γHH T )ik Sik + 4(XT XS)ik Sik +4α(S)ik Sik + 2βSik +γ((HH T )ii + (HH T )jj )Sik = 0 (43) ⇔ T T 2(2X X + γW W)ik Sik (4(XT XS)ik + 4αSik + 2β + γ((H T H)ii + (H T H)jj ))Sik . Therefore, the update rules can be derived as Sik = Sik (XT XS

(XT X + 21 γHH T )ik

+ αS)ik +

1 γ((HH T )ii 4

+

(HH T )

jj )

+

β 2

.

(44)

Since G2F (S, S ) is an auxiliary function of L2F (S ), L2F (S ) is non-increasing under this update rules (44).

Acknowledgment The authors would like to thank the anonymous referees for useful suggestions and comments.

References [1] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999. [2] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Advances in Neural Information Processing Systems 13, T. Leen, T. Dietterich, and V. Tresp, Eds. Cambridge, MA, USA: MIT Press, 2000. [3] I. Buciu and I. Pitas, “A new sparse image representation algorithm applied to facial expression recognition,” in Proc. IEEE Workshop Mach. Learn. Signal Process., 2004, pp. 539–548. [4] Y. Xue, C. S. Tong, W.-S. Chen, W. Zhang, and Z. He, “A modified nonnegative matrix factorization algorithm for face recognition,” in Proc. 18th Int. Conf. Pattern Recognit., 2006, pp. 495–498. [5] R. C. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn, “Graph-preserving sparse nonnegative matrix factorization with application to facial expression recognition,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 38–52, Feb. 2011. [6] H. Kim and H. Park, “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495–1502, 2007. [7] Z. Yang, T. Hao, O. Dikmen, X. Chen, and E. Oja, “Clustering by nonnegative matrix factorization using graph random walk” Advances in Neural Information Processing Systems, 2012. [8] O. Zoidi, A. Tefas, and I. Pitas, “Multiplicative update rules for concurrent nonnegative matrix factorization and maximum margin classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 422–434, Mar. 2013. [9] S. Zafeiriou, A. Tefas, I. Buciu, and I. Pitas, “Exploiting dis-criminant information in nonnegative matrix factorization with application to frontal face verification,” IEEE Trans. Neural Netw., vol. 17, no. 3, pp. 683–695, May 2006. [10] I. Kotsia, S. Zafeiriou, and I. Pitas, “A Novel discriminant non-negative matrix factorization algorithm with applications to facial image characterization problem,” IEEE Trans. Inf. Forensics Security, vol. 2, no. 3, pp. 588–595, Sep. 2007.

[11] Y. Wang, Y. Jia, C. Hu, and M. Turk, “Fisher non-negative matrix factorization for learning local features,” in Proc. Asian Conf. Comput. Vision, 2004, pp. 27–30. [12] Y. Xue, C. S. Tong, W.-S. Chen, W. Zhang, and Z. He, “A modified nonnegative matrix factorization algorithm for face recognition,” in Proc. 18th Int. Conf. Pattern Recognit., 2006, pp. 495–498. [13] I. Buciu and I. Pitas, “A new sparse image representation algorithm applied to facial expression recognition,” in Proc. IEEE Workshop Mach. Learn. Signal Process., 2004, pp. 539–548. [14] S. Z. Li, X. W. Hou, and H. J. Zhang, “Learning spatially localized, parts-based representation,” in Proc. CVPR, 2001, pp. 207–212. [15] P. O. Hoyer, “Non-negative sparse coding,” in Proc. Neural Netw. Signal Process., 2002, 557–565. [16] P. O. Hoyer, “Nonnegative matrix factorization with sparseness constraints [J],” J. Mach. Learn. Res., vol. 5, no. 9, pp. 1457–1469, 2004. [17] C. Zhan, W. Li, and P. Ogunbona, “Local representation of faces through extended NMF,” Electron. Lett., vol. 48, no. 7, pp. 373–375, 2012. [18] P. D. O’Grady and B. A. Pearlmutter, “Convolutive non-negative matrix factorisation with a sparseness constraint,” in Proc. 16th IEEE Signal Process. Soc. Workshop MLSP, 2006, pp. 427–432. [19] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in Proc. Int. Conf. Res. Develop. Inf. Retrieval, 2003, pp. 267–273. [20] C. Ding, T. Li, and M. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 45–55, Jan. 2010. [21] Z. Yuan and E. Oja, “Projective nonnegative matrix factorization for image compression and feature extraction,” in Proc. Scand. Conf. Image Anal., 2005, pp. 333–342. [22] Z. Yang and E. Oja, “Linear and nonlinear projective nonnegative matrix factorization,” IEEE Trans. Neural Netw., vol. 21, no. 5, pp. 734–749, May 2010. [23] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal non-negative matrix tri-factorizations for clustering,” in Proc. Int. Conf. KDD, 2006, pp. 126–135. [24] C. Ding, X. He, and H. D. Simon, “On the equiva-lence of nonnegative matrix factorization and spectral clustering,” in Proc. SDM, 2005, pp. 606–610. [25] C. Ding, T. Li, and W. Peng, “Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method,” in Proc. AAAI, 2006, pp. 342–347. [26] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in Neural Information Processing Systems 14. Cambridge, MA, USA: MIT Press, 2001, pp. 585–591. [27] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, Dec. 2000. [28] D. Kong, C. Ding, H. Huang, and F. Nie, “An iterative locally linear embedding algorithm,” in Proc. ICML, 2012. [29] S. Wild, “Seeding non-negative matrix factorization with the spherical k-means clustering,” M.S. thesis, Dept. Appl. Math., Univ. Colorado, Boulder, CO, USA, 2002. [30] S. Wild, J. Curry, and A. Dougherty, “Motivating non-negative matrix factorizations,” in Proc. 8th SIAM Conf. Appl. Linear Algebra, 2003. [31] C. Boutsidis and E. Gallopoulos, “On SVD-based initializa-tion for nonnegative matrix factorization,” Univ. Patras, Patras, Greece, Tech. Rep. HPCLAB-SCG-6/08-05, 2005, [32] C.-J. Lin, “Projected gradient methods for non-negative matrix factorization,” Dept. Comput. Sci., Nat. Taiwan Univ., Taipei, Taiwan, Inf. Support Serv. Tech. Rep. IS-STECH-95-013, 2005. [33] M. Chu, F. Diele, R. Plemmons, and S. Ragni. (2004). Optimality, Computation, and Interpretations of Nonnegative Matrix Factorizations [Online]. Available: http://www.wfu.edu/plemmons [34] P. Y. Han, A. T. B. Jin, and T. K. Ann, “Kernel discriminant embedding in face recognition,” J. Visual Commun. Image Represent., vol. 22, no. 7, pp. 634–642, 2011. [35] D. Zhou and Z. Tang, “Kernel-based improved discriminant analysis and its application to face recognition,” Soft Comput. J., vol. 14, no. 2, pp. 103–111, 2010. [36] A. S. Khalifa, R. A. Ammar, M. F. Tolba, and T. Fergany, “Dynamic online allocation of independent task onto heterogeneous computing systems to maximize load balancing,” in Proc. 8th IEEE Int. Symp. Signal Process. Inf. Technol., 2008, p. 418425. [37] D. Cai, X. He, X. Wu, and J. Han, “Non-negative matrix factorization on manifold,” in Proc. 8th IEEE ICDM, 2008, pp. 63–72.


[38] T. L. Ding, W. Peng, and H. Park, “Orthogonal nonnegative matrix tfactorizations for clustering,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, 2006, pp. 126–135. [39] S. Lloyd, “Last square quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982. [40] J. de Leeuw and W. J. Heiser, “Convergence of correction matrix algorithms for multidimensional scaling” in Geometric Representations of Relational Data, J. C. Lingoes, E. Roskam, and I. Borg, Eds. Ann Arbor, MI, USA: Mathesis Press, 1977, pp. 735–752. [41] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B (Methodological), vol. 39, no. 1, pp. 1–38, 1977. Xiaobing Pei received the Ph.D. degree in computer engineering from Huazhong University of Science and Technology, Hubei, China, in 2006. From 1997 to 2000, he served as a Software Engineer with Shenzhen Bureau of Posts and Telecommunications, Shenzhen, China, and from 2000 to 2003, he was a Research Engineer and Software Developer with Research and Development Department, Huawei Technologies, Shenzhen. He is currently an Associate Professor with the School of Software, Huazhong University of Science and Technology. His current research interests include machine learning, data mining, and software engineering. Tao Wu is currently a Professor with the School of Software engineering of Huazhong University of Science and Technology, Hubei, China. Since 2007, he has been serving as the Head of the Software Science and Engineering Department. He has published research papers in Journal of Network and Computer Applications, Computer Modeling in Engineering, and Science, Journal of Software, and so on. His current research interests include software engineering and cloud computing, network and information security, and topology optimization

1831

Chuanbo Chen is currently a Professor and Dean of the Software College, Huazhong University of Science And Technology, Hubei, China. He was a Professor and Researcher on image processing and software engineering, and a Research Leader in many projects such as national projects, local government projects, and the enterprise projects. He has published more than 200 academic papers and four books. He is an information technology expert with over 20 years of experience as a Technical/Research Teacher, Senior Manager, and Systems Architect. He has extensive experience in design and development, full lifecycle development including analysis, design, development, documentation, and implementation, and in mentoring and training other developers. He has led development teams, and was associated with large projects for large companies. He has provided project life cycle management, business cases, elaboration, architecture design, project iteration, coding review, testing, and project auditing.

Variational regularized 2-D nonnegative matrix factorization.

Limited-memory fast gradient descent method for graph regularized nonnegative matrix factorization.

Drug-Target Interaction Prediction with Graph Regularized Matrix Factorization.

Convex nonnegative matrix factorization with manifold regularization.

Determining functional units of tongue motion via graph-regularized sparse non-negative matrix factorization.

Max-min distance nonnegative matrix factorization.

Symptom Names Using Multi-View Nonnegative Matrix Factorization.

Integrative clustering by nonnegative matrix factorization can reveal coherent functional groups from gene profile data.

A fast algorithm for nonnegative matrix factorization and its convergence.

Mining seasonal marine microbial pattern with greedy heuristic clustering and symmetrical nonnegative matrix factorization.

Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification.

Sparse Nonnegative Matrix Factorization Strategy for Cochlear Implants.

Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction.

3-D Lung Segmentation by Incremental Constrained Nonnegative Matrix Factorization.

Uncovering community structures with initialized Bayesian nonnegative matrix factorization.

Link community detection using generative model and nonnegative matrix factorization.

Nonnegative matrix factorization for the identification of EMG finger movements: evaluation using matrix analysis.

Online nonnegative matrix factorization with robust stochastic approximation.

A Quasi-Likelihood Approach to Nonnegative Matrix Factorization.

Gene Feature Extraction Based on Nonnegative Dual Graph Regularized Latent Low-Rank Representation.

Two-hierarchical nonnegative matrix factorization distinguishing the fluorescent targets from autofluorescence for fluorescence imaging.

On nonnegative matrix factorization algorithms for signal-dependent noise with application to electromyography data.

Structure constrained semi-nonnegative matrix factorization for EEG-based motor imagery classification.

Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data.