Constrained concept factorization for image representation.

1214

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 7, JULY 2014

Constrained Concept Factorization for Image Representation Haifeng Liu, Member, IEEE, Genmao Yang, Zhaohui Wu, Senior Member, IEEE, and Deng Cai, Member, IEEE

Abstract—Matrix factorization based techniques, such as nonnegative matrix factorization and concept factorization, have attracted great attention in dimensionality reduction and data clustering. Previous studies show that both of them yield impressive results on image processing and document clustering. However, both of them are essentially unsupervised methods and cannot incorporate label information. In this paper, we propose a novel semisupervised matrix decomposition method for extracting the image concepts that are consistent with the known label information. With this constraint, we call the new approach constrained concept factorization. By requiring that the data points sharing the same label have the same coordinate in the new representation space, this approach has more discriminating power. The experimental results on several corpora show good performance of our novel algorithm in terms of clustering accuracy and mutual information. Index Terms—Clustering, dimensionality reduction, nonnegative matrix factorization, semisupervised learning.

I. Introduction

W

ITH THE rapid growth of the number of digital images, there is an increasing demand for efficient data representation techniques to process images, especially in the area of computer vision and pattern recognition. A good data representation can improve the image processing by finding the latent semantic structure, enhancing generalization capability, alleviating the effect of the curse of dimensionality, speeding up learning process, and improving model interpretability. In this paper, we consider the problem of image representation and clustering. Generally, the representation space of the images is of high dimensionality. However, the original representation does not reveal the latent structure in the data explicitly; thus, the clustering algorithms based on the original Manuscript received March 3, 2013; revised August 2, 2013; accepted September 28, 2013. Date of publication November 1, 2013; date of current version June 12, 2014. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2013CB336500, in part by the National Natural Science Foundation of China under Grant 61379071 and Grant 91120302, in part by the Zhejiang Provincial Natural Science Foundation of China under Grant Y12F020150, in part by the Qian Jiang Talented Program of Zhejiang Province under Grant 2011R10055, in part by the Doctoral Fund of Ministry of Education New Faculty Program under Grant 20100101120067, and in part by the National High-tech R&D Program of China (863 Program) under Grant 2012AA02A601. This paper was recommended by Associate Editor J. Basak. The authors are with the College of Computer Science, Zhejiang University, Hangzhou 310058, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). (Corresponding author: Z. Wu) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2287103

representation cannot perform the task well. Moreover, with the increasing number of the size of data collection, the algorithm efficiency is also a big challenge for image clustering. Following the intuition that naturally occurring image data may be generated by structured systems with possibly much fewer degrees of freedom than the ambient dimension would suggest, various researchers have considered the case when the data lies on or close to a lower dimensional subspace of the ambient space [1]–[8]. One then hopes to estimate intrinsic subdimensional properties of the data from random points lying on this unknown space. Recent studies have shown that image clustering performance can be improved significantly in low dimensional linear subspaces. Especially, the matrix factorization based techniques have yielded impressive results. The typical matrix factorization algorithms include principal component analysis [9], singular value decomposition [10], vector quantization [11], nonnegative matrix factorization (NMF) [4], [12], and concept factorization (CF) [1]. Generally, the goal of all the above algorithms is to find the new basis vectors that can be used to represent the data points. The new representation maps the data into a low dimensional space but gives more meaningful semantic interpretation. Central to these matrix factorization algorithms is to find two or more matrix factors whose product is a good approximation to the original matrix. One factor matrix is regarded as the set of bases and each data is considered a linear combination of the found bases. For example, NMF aims to decompose a data matrix X into two matrix factors U and V so that UVT provides a good approximation to X. NMF specializes in that it enforces the factor matrices must be nonnegative. This nonnegative constraint leads NMF to a parts-based representation of the object. Therefore, it is an ideal dimensionality reduction algorithm for image processing, face recognition [4], [13], and document clustering [14], where it is natural to consider the object as a combination of parts to form a whole. The CF model is a variation of NMF in that each cluster is expressed by a linear combination of the data points and each data point is represented by a linear combination of the cluster centers. Hence, the original data matrix X is approximated by the product of three matrices U, V, and X. Either NMF or CF is an unsupervised learning algorithm. These algorithms cannot differentiate the labeled data from the unlabeled data, and thus cannot take advantage of the label information when such information is available. Semisupervised algorithms address learning problems using large amount

c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

LIU et al.: CONSTRAINED CONCEPT FACTORIZATION FOR IMAGE REPRESENTATION

of unlabeled data, together with the labeled data, to build better models. Because semisupervised learning gives higher accuracy, it is of great interest both in theory and in practice if we can extend NMF and CF to semisupervised learning algorithms. The general way to extend an unsupervised algorithm to a semisupervised algorithm is regularization by graph. In this way, a graph is defined where the nodes are labeled and unlabeled examples in the dataset, and edges (may be weighted) reflect the similarity of examples. Then, the graph is inserted into the objective function as a regularizer. Recently, Cai et al. [15], [16] proposed a graph regularized NMF (GNMF) approach to encode the geometrical information of the data space. GNMF constructs a nearest-neighbor graph to model the local manifold structure. When label information is available, it can be naturally incorporated into the graph structure. Specifically, if two data points share the same label, a large weight can be assigned to the edge connecting them. If two data points have the different labels, the corresponding weight is set to be 0. This gives rise to semisupervised GNMF. To the best of our knowledge, there has been no work to extend the usage of concept factorization to a semisupervised manner. Although we can perform the same process as GNMF and build a semisupervised graph regularized CF, there are still some limitations. The major disadvantage of this approach is that there is no theoretical guarantee that data points from the same class will be mapped together in the new representation space, and it remains unclear how to select the weights in a principled manner. To address the above problem, in this paper, we propose a novel semisupervised image clustering method called constrained concept factorization (CCF) that extracts the image concepts consistent with the known label information based on matrix factorization. The new CCF model can guarantee that the data points sharing the same label are mapped into the same concept in the low-dimensional space; thus, the obtained concepts can well capture the intrinsic semantic structure and the images associated with similar concepts can be well clustered. Moreover, unlike the semisupervised GNMF, one advantage of our approach is that it is parameter free. Therefore, there is no cost of tuning parameters in order to get the best result. Thus, CCF is applicable to many real-world applications very easily and efficiently. The remainder of this paper is structured as follows. In Section II, we briefly review the background of matrix factorization models (both NMF and CF models belong to this category) and semisupervised dimensionality reduction methods. Section III introduces the detailed CCF algorithm and theoretical analysis. A variety of experimental results are presented in Section IV. Finally, we provide the concluding remarks in Section V. II. Background Given a data set with high dimensionality matrix factorization is a common approach to compress the data by finding a set of basis vectors and the representation with respect to the basis for each data point.

1215

Suppose we have n data points {xi }ni=1 . Each data point xi ∈ Rm is m-dimensional and is represented by a vector. The vectors are placed in the columns and the whole data set is represented by a matrix X = [x1 , · · · , xn ] ∈ Rm×n . Matrix factorization can be mathematically defined as finding two matrices U ∈ Rm×k and VT ∈ Rk×n whose product can best approximate X X ≈ UVT . Each column of U can be regarded as a basis vector that captures the higher-level features in the data, and each column of VT is the k-dimensional representation of the original inputs with respect to the new basis. From this sense, matrix factorization can also be regarded as a dimensionality reduction method since it reduces the dimension from m to k. Factorization of matrices is generally nonunique, and a number of different methods of doing so have been developed by incorporating different constraints. NMF [17] is one of the most popular matrix factorization methods that focuses on the analysis of data matrices whose elements are nonnegative. NMF adds the nonnegative constraint on both U and VT . The nonnegative constraints allow only additive combinations among different basis vectors. Thus, it is believed that NMF can learn a parts-based representation [17]. Although the nonnegativity of NMF may be a desirable property for image processing since the factorization result has better semantic interpretation and the clustering result can be easily derived from it, there are still some limitations. One is that the nonnegative requirement is not applicable to many applications where the data involves negative number. The second is that it is not clear how to effectively perform NMF in the transformed data space (i.e., reproducing kernel Hilbert space) so that the powerful kernel method can be applied [1]. Concept factorization is proposed striving to address the above problems while inheriting all the strengths of the NMF method. The main distinguishing advantage of CF is that it can be applied to any type of data representation, either in the original space or in the kernel space [1]. In the CF model, we rewrite the NMF model by representing each base (cluster center) uj by a linear combination of the data points uj = wij xi i

where wij ≥ 0. Let W = [wij ] ∈ Rn×k , CF tries to decompose the data matrix satisfying the following condition: X ≈ XWVT . Using the Frobenius norm to qualify the approximation, CF tries to minimize the following objective function: O = X − XWVT 2 .

(1)

The multiplicative updating rules minimizing the above objective function are given as [1] t wt+1 jk = wjk

KV

T

jk

KWV V

t vt+1 jk ← vjk jk

KW T

jk

VW KW

(2) jk

1216

where K = XT X. These multiplicative updating rules only involve the inner product of X, and thus CF can be easily kernelized. Please see [1] for details. The representation matrix VT learned in the above two methods is usually dense. Since each basis vector (column vector of U) can be regarded as a concept, the density of VT indicates that each image is a combination of all the concepts. This is contrary to our common knowledge since most of the images only include several semantic concepts. Sparse coding (SC) [18]–[21] is a recent popular matrix factorization method trying to solve this issue. SC adds the sparse constraint on VT , more specifically, on each column of VT . In this way, SC can learn a sparse representation. SC has several advantages for data representation. First, it yields sparse representations such that each data point is represented as a linear combination of a small number of basis vectors. Thus, the data points can be interpreted in a more elegant way. Second, sparse representations naturally make for an indexing scheme that would allow quick retrieval. Third, the sparse representation can be over-complete, which offers a wide range of generating elements. Potentially, the wide range allows more flexibility in signal representation and more effectiveness at tasks such as signal extraction and data compression. Finally, there is considerable evidence that biological vision adopts sparse representations in early visual areas [22]. The idea presented in this paper can naturally be used on sparse coding. However, a detailed discussion is beyond the scope of this paper. Matrix factorization can also be regarded as the dimensionality reduction method. Recently, there are many works on developing semisupervised dimensionality reduction algorithms. Liu et al. [23] proposed a robust and scalable graphbased semisupervised learning algorithm. Yang et al. [24] considered prior information in the form of on-manifold coordinates of certain data samples and presented three nonlinear algorithms: semisupervised LLE, semisupervised ISOMAP, and supervised LTSA.1 Cai et al. [25] proposed a semisupervised discriminant analysis (SDA) to learn a discriminant function that is as smooth as possible on the data manifold. Zhang et al. [26] proposed the semisupervised dimensionality reduction algorithm, which can preserve the intrinsic structure of the unlabeled data as well as both the must-link and cannot-link constraints defined on the labeled examples in the projected low-dimensional space. Zhang et al. [27] augmented the alignment matrix of LTSA with the orthogonal projection based on the known parameter vectors, giving rise to the eigenvalue problem that characterizes the semisupervised manifold learning problem. He et al. [28] presented a method called maximum margin projection, which aims at maximizing the margin between positive and negative examples at each local neighborhood. Zhang et al. [29] utilized unlabeled data to maximize an optimality criterion of linear discriminant analysis (LDA) and used the constrained concave-convex procedure to solve the optimization problem, which leads to estimation of the class


labels for the unlabeled data. The selected unlabeled data can then be used to augment the original labeled data set for performing LDA. Sugiyama et al. [30] proposed a semisupervised local Fisher discriminant analysis, which preserves the global structure of unlabeled samples in addition to separating labeled samples in different classes from each other. The most recent work by Zhang et al. [31] presents two multimodal nonlinear techniques called trace ratio (TR) criterion based semisupervised LAE (S2 LAE) and LLE (S2 LLE), which apply the pairwise must-link and cannot-link constraints induced by the neighborhood graph to specify the types of neighboring pairs.2

III. Concept Factorization With Constraints Both NMF and CF fail to take advantage of the label information when such information is available, which can be used to improve the clustering accuracy. In this section, we introduce a novel matrix decomposition method, called CCF, which takes the label information as additional constraints. The central idea of CCF is to represent the label information by a constraint matrix and to incorporate the constraint matrix into matrix decomposition. By doing this, the low-dimensional data representation is consistent with the known label information. A. Objective Function Consider a data set consisting of n data points {xi }ni=1 , among which the first p data points x1 , · · · , xp are labeled with one class from c classes. For these p data points, we represent their label information in a p × c label indicator matrix C. ci,j = 1 if xi is labeled with the jth class; otherwise, ci,j = 0. For example, consider n data points among which x1 , x2 and x3 are labeled with class I, x4 and x5 are labeled with class II, x6 is labeled with class III, and the other n − 6 data points are unlabeled. Then, the label indicator matrix C based on this example can be represented as follows: ⎛ ⎞ 100 ⎜1 0 0⎟ ⎜ ⎟ ⎜1 0 0⎟ ⎟ C=⎜ ⎜0 1 0⎟. ⎜ ⎟ ⎝0 1 0⎠ 001 With the label indicator matrix C and other n − p unlabeled data points, we define a constraint matrix A as follows: Cp×c 0 A= 0 In−p where In−p is an (n − p) × (n − p) identity matrix. Recall that CF maps each data point xi to vi from m-dimensional space to k-dimensional space. In order that the data points sharing the same label are mapped into the same class in the low-dimensional space (i.e., same vi ), we impose the label constraints by introducing an auxiliary matrix Z V = AZ.

(3)

1 Locally

linear embedding (LLE), isometric feature mapping (ISOMAP), and local tangent space alignment (LTSA) are basic unsupervised nonlinear dimensionality reduction algorithms.

2 Laplacian eigenmaps (LAE) is a basic unsupervised nonlinear dimensionality reduction algorithm.


Based on this equation, we can guarantee that if xi and xj have the same label, then vi = vj . Replace V in (1) by (3), we obtain the objective function of CCF O = X − XWZT AT 2 .

(4)

In this way, the original CF is extended to a semisupervised learning algorithm (CCF) by finding two matrix factors W and Z where the product of X, W, Z and the constraint factor A is an approximation to the original matrix. B. Multiplicative Algorithm As we know in the CF model, it is difficult to find the global optimal solution of the objective function. We describe an algorithm to obtain the local optima of O by iterative updating rules. Define K = XT X, and use the matrix properties A2 = Tr(AT A), Tr(AB) = Tr(BA) and Tr(A) = Tr(AT ), the objective function is rewritten as O = Tr((X − XWZT AT )T (X − XWZT AT )) = Tr((I − WZT AT )T K(I − WZT AT )) Let αij and βij be the Lagrange multiplier for constraint wij ≥ 0 and zij ≥ 0. We define α = [αij ], β = [βij ]; then, the Lagrange L is L = O + Tr(αW ) + Tr(βZ ). T

These equations lead to the following updating rules: (KAZ)ij wij ← wij (KWZT AT AZ)ij (AT KW)ij zij ← zij T . (A AZWT KW)ij

Z = Z[diag(WT KW)]1/2 T

W = W[diag(W KW)]

(8) (9)

(10) (11)

We have the following theorem regarding the above iterative updating rules. Theorem 1: The objective function O in (4) is nonincreasing under the update rules in (10) and (11). The objective function is invariant under these updates if and only if W and Z are at a stationary point. Theorem 1 guarantees the convergence of the iterations in (10) and (11), and therefore, the final solution will be a local optima. In the next section, we will give the proof of algorithm convergency. The solution to minimize the objective function O is not unique. It is easy to check that if W and Z are the solution to O, WD and ZD−1 will also form a solution for any

−1/2

(12) .

(13)

C. Convergence Proof Following the similar proof for NMF algorithm in [12], we also use an auxiliary function as used in the expectationmaximization algorithm [32] to prove the convergence of our CCF algorithm. To make the proof integrated, we restate the definition of an auxiliary function and its property that will be used to prove the algorithm convergence. Definition 1: G(x, x ) is an auxiliary function for F (x) if the conditions G(x, x) = F (x)

are satisfied. Lemma 1: If G is an auxiliary function, then F is nonincreasing under the update xt+1 = arg min G(x, x ).

(14)

x

(5)

Fixing one variable, the partial derivatives of O with respect to the other variable are ∂L = −2KAZ + 2KWZT AT AZ + α (6) ∂W ∂L = −2AT KW + 2AT AZWT KW + β. (7) ∂Z Using the Kuhn–Tucker condition αij wij = 0 and βij zij = 0, we get the following equations: (KAZ)ij wij − (KWZT AT AZ)ij wij = 0 (AT KW)ij zij − (AT AZWT KW)ij zij = 0.

positive diagonal matrix D. Therefore, we further normalize the solution to make it unique. Let wc be the column vector of W, with the constraints Rc = wcT Kwc = 1 and WZT does not change, W and Z should be updated with normalization

G(x, x ) ≥ F (x),

= Tr(K − 2WT KAZ + WT KWZT AT AZ).

T

1217

Proof: F (xt+1 ) ≤ G(xt+1 , xt ) ≤ G(xt , xt ) = F (xt ). The equality F (xt+1 ) = F (xt ) holds only if xt is a local minimum of G(x, xt ). By iterating the updates in (14), the sequence of estimates will converge to a local minimum xmin = arg minx F(x). Next, we will define an auxiliary function for our objective function and use Lemma 1 to show that the minimum of the objective function is exactly our update rule, and thereby Theorem 1 is proved. First, we prove the convergence of the update rule in (11). For any element zab in Z, we use Fzab to denote the part of O relevant to zab . Since the update is essentially element-wise, it is sufficient to show that each Fzab is nonincreasing under the update step of (11). We prove this by defining the auxiliary function G for Fzab as follows. Lemma 2: The function G(z, ztab ) = Fzab (ztab ) + Fzab (ztab )(z − ztab ) (AT AZW T KW)ab (z − ztab )2 + ztab

(15)

is an auxiliary function for Fzab , which is the part of O that is only relevant to zab . Proof: The Taylor series expansion of Fzab is 1 Fzab (z) = Fzab (ztab ) + Fzab (ztab )(z − ztab ) + Fzab (z − ztab )2 . 2 Since ∂2 O = 2AT AW T KW ∂Z2 Fzab = (AT A)aa (W T KW)bb

1218


TABLE I

and (AT AZW T KW)ab =

Abbreviations for Reporting Operation Counts

(AT AZ)ak (W T KW)kb

k

≥ (AT AZ)ab (W T KW)bb ≥ (AT A)ak ztkb (W T KW)bb k

≥ ztab (AT A)aa (W T KW)bb 1 ≥ ztab Fzab . 2

and by solving zt+1 = arg minz G(z, ztab ), we obtain Fzab (ztab ) t t = z − z zt+1 ab ab ab (AT AZW T KW)ab (AT KW)ab = ztab T (A AZW T KW)ab

So we have G(z, ztab ) ≥ Fzab (z). The auxiliary function for the objective function with regard to variable wab is defined as follows. Lemma 3: The function G(w, wtab ) = Fwab (wtab ) + Fw ab (wtab )(w − wtab ) (KWZT AT AZ)ab (w − wtab )2 + wtab

(16)

is an auxiliary function for Fwab , which is the part of O that is only relevant to wab . Proof: The proof is essentially similar to the proof of Lemma 2, compare G(w, wtab ) with the Taylor series expansion T T of Fwab , we just need to prove (KWZ wAt AZ)ab ≥ 21 Fwab . Since ab we have ∂2 O = 2KZT AT AZ ∂W 2 Fwab = (K)aa (ZT AT AZ)bb and (KWZT AT AZ)ab =

(KW)ai (ZT AT AZ)ib

i

≥ (KW)ab (ZT AT AZ)bb ≥ (K)ai wtib (ZT AT AZ)bb i

≥ wtab (K)aa (ZT AT AZ)bb 1 ≥ wtab Fwab . 2

(18)

which are exactly the same updates as in (10) and (11). Therefore, the objective function O in (4) is nonincreasing under these updates. ⵧ D. Computational Complexity Analysis In this subsection, we discuss the extra computational cost of our proposed algorithm compared to standard CF. The common way to express the complexity of one algorithm is using big O notation [33]. However, it is not precise enough to differentiate the complexities of CF and CCF. Thus, we count the arithmetic operations for each algorithm. Three operation abbreviations used in this paper are summarized in Table I. Please see [34] for more details about these operation abbreviations. Based on the updating rules in (2), it is not hard to count the arithmetic operations of each iteration in CF. We summarize the results in Table II. For CCF, it is important to note that each row of A contains one 1. Thus, there is no addition and multiplication in computing AZ and one only needs nk fladd in computing AT B, where B is an n × k arbitrarily matrix. We also summarize the arithmetic operations for each iteration of CCF in Table II. Compared to CF, CCF only needs 2nk more fladd in each iteration. This additional cost is dominated by the remaining cost of CCF. Thus, the overall costs of both CF and CCF in each iteration are O(n2 k). Besides the multiplicative updating, both CF and CCF need to compute the kernel matrix K that requires O(n2 m) operations. Suppose the multiplicative updates stop after t iterations, the overall costs for both CF and CCF are O(tn2 k + n2 m).

(19)

So we have E. Algorithm for General Form

G(w, wtab ) ≥ Fwab (w). Proof of Theorem 1: From Lemma 3, we know that G(w, wtab ) is an auxiliary function for Fwab , and from Lemma 2, we know that G(z, ztab ) is an auxiliary function for Fzab . According to Lemma 1, by solving wt+1 = arg minw G(w, wtab ), we obtain Fw ab (wtab ) (KWZT AT AZ)ab (KAZ) ab = wtab (KWZT AT AZ)ab

t t wt+1 ab = wab − uab

(17)

The above algorithm only works when K is nonnegative. In the following, we introduce an algorithm that finds the nonnegative solution for the general case. This algorithm follows a similar approach described in [1] which is based on a theorem proposed in [35]. Theorem 2: Define the nonnegative general quadratic form as 1 Q(Y ) = Y T BY + cT Y 2 where Y is an m-dimensional nonnegative vector, B is a symmetric semipositive definite matrix, and c is an arbitrary


1219

TABLE II Computational Operation Counts for Each Iteration in CF and CCF

n is the number of sample points, m is the number of features, k is the number of factors, and p is the number of labeled data points.

TABLE III Statistics of Image Datasets

vector. Let B = B+ − B− , where B+ and B− are two symmetric matrices defined as

Bij if Bij > 0 − |Bij | if Bij < 0 + Bij = B = 0 otherwise ij 0 otherwise then, the solution Y that minimizes Q(Y ) can be obtained by the following iterative update: −ci + ci2 + 4(B+ Y )i (B− Y )i yi ← yi . (20) 2(B+ Y )i Obviously, our objective O is a quadratic form of W (and V); thus, we just need to identify the corresponding B and c in the objective function and then apply Theorem 2 to obtain the solution for the general form. For variable W, the two parameters for the quadratic form of O(W) can be obtained by fixing Z and taking derivative with respect to W at W = 0. B is the value of the second-order derivative with respect to W at W = 0 ∂2 O |W=0 = 2kik (ZT AT AZ)lj . ∂wij ∂wkl

(21)

c is the value of the first-order derivative with respect to W at W = 0 ∂O |W=0 = −2(KAZ)ij . (22) ∂wij Let K = K+ − K− where K+ and K− are symmetric matrices whose elements are all positive

Kij if Kij > 0 − |Kij | if Kij < 0 Kij+ = Kij = 0 otherwise 0 otherwise substituting B and c in (20) with the right-hand side of (21) and (22), respectively, we obtain the multiplicative updating equation for computing each element wij of W (KAZ)ij + (KAZ)2ij + Pij+ Pij− wij ← wij (23) 2Pij+

Fig. 1. Sample images for three data sets. (a) Yale sample images. (b) COIL20 sample images. (c) MNIST sample images.

Similarly, fixing W, the corresponding coefficients for variable Z are obtained ∂O |Z=0 = −2(AT KW)ij (25) ∂zij ∂2 O |Z=0 = 2δik (AT A)ik (WT KW)lj ∂zij ∂zkl

(26)

where δik equals 1 if i = k and equals 0 otherwise. The update rule for Z is (AT KW)ij + (AT KW)2ij + Q+ij Q− ij zij ← zij (27) 2Q+ij

where P+ = K+ WZT AT AZ and P− = K− WZT AT AZ. For the case that all elements of matrix K are positive, the solution becomes (KAZ)ij wij ← wij (24) Pij+

where Q+ = 2AT AZWT K+ W and Q− = 2AT AZWT K− W. For the case that all elements of matrix K are positive, the solution becomes

which is exactly the same form as (10).

which is exactly the same form as (11).

zij ← zij

(AT KW)ij Q+ij

(28)

1220


TABLE IV Clustering Results on the Yale Face Database

Fig. 2.

Clustering performance on Yale database. (a) Accuracy versus number of clusters. (b) Mutual information versus number of clusters.

IV. Experiments In this section, we evaluate the CCF algorithm on image clustering to verify the effectiveness of our algorithm. A. Data Corpora We conducted the performance evaluation on three image data sets. Next, we will describe the details of these data corpora separately. The statistics of the following selected corpus is summarized in Table III. 1) Yale Database:3 The Yale database contains 165 grayscale images of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center/right/left-light, w/no 3 Available

at http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

glasses, happy, sad, sleepy, surprised, and wink. In all the experiments, images are preprocessed so that faces are located. Original images are first normalized in scale and orientation such that two eyes are aligned at the same position. Then, the facial areas were cropped into the final images for clustering. Each image is 32 × 32 (1024-dimensional) pixels with 256 gray levels per pixel. 2) COIL20 Database: 4 Columbia Objects Image Library (COIL-20) is a database consisting of gray-scale images of 20 objects. The objects were placed on a motorized turntable against a black background. The turntable was rotated through 360deg and a fixed camera took images at a pose intervals of 5deg for each object. Thus, each 4 Available

at http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php.


1221

TABLE V Clustering Results on the COIL20 Database

Fig. 3.

Clustering performance on COIL20 database. (a) Accuracy versus number of clusters. (b) Mutual information versus number of clusters.

object has 72 images in total. The size of each image is the same as Yale image, which is represented by a 1024-dimensional vector in image space. 3) MNIST Database:5 The MNIST database of handwritten digits has a training set of 60 000 examples, and a test set of 10 000 examples. We randomly selected 1000 samples for the experiment. These digit images have been size-normalized and centered in a fixed-size image, resulting in a 28 × 28 gray scale image.

cluster label of each sample with the label provided by the data set. One metric is accuracy (AC), which is used to measure the percentage of correct labels obtained. Given a data set containing n images, for each sample image mi , let li be the cluster label that we obtained by applying different algorithms and ri be the label provided by the data set. The accuracy (AC) is defined as n

We use two metrics to evaluate the clustering performance [14], [16]. The result is evaluated by comparing the 5 Available

at http://yann.lecun.com/exdb/mnist/.

δ(ri , map(li ))

(29) n where δ(x, y) is the delta function that equals one if x = y and equals zero otherwise, and map(li ) is the mapping function that maps each cluster label li to the equivalent label from the data set. AC =

B. Evaluation Metrics

i=1

1222


TABLE VI Clustering Results on the MNIST Database

Fig. 4.

Clustering performance on MNIST database. (a) Accuracy versus number of clusters. (b) Mutual information versus number of clusters.

The second metric is the normalized mutual information In clustering applications, mutual information is used to (MI). measure how similar two sets of clusters are. Given two sets of image clusters C and C , their mutual information metric MI(C, C ) is defined as MI(C, C ) =

ci ∈C,cj ∈C

p(ci , cj ) · log

p(ci , cj ) p(ci ) · p(cj )

(30)

where p(ci ), p(cj ) denote the probabilities that an image arbitrarily selected from the data set belongs to the clusters ci and cj , respectively, and p(ci , cj ) denotes the joint probability that this arbitrarily selected image belongs to the cluster ci as well as cj at the same time. MI(C, C ) takes values between zero and max(H(C), H(C )), where H(C) and H(C ) are the

entropies of C and C , respectively. It reaches the maximum max(H(C), H(C )) when the two sets of image clusters are identical and it becomes zero when the two sets are completely independent. One important character of MI(C, C ) is that the value keeps the same for all kinds of permutations. In our experiments, we use the normalized metric MI(C, C ) that takes values between zero and one MI(C, C) =

MI(C, C ) . max(H(C), H(C ))

(31)

C. Performance Evaluations and Comparisons To demonstrate the advantage of our algorithm on image clustering, we compare our CCF algorithm with the following popular clustering algorithms. Among them, we not only


compare CCF with the standard CF and NMF methods to show the benefit of incorporating label information. We also compare with other semisupervised algorithms to show the effectiveness of our CCF algorithm: 1) standard KMeans clustering method (KMeans); 2) nonnegative matrix factorization using update based on Euclidean distance (NMF) [14]; 3) traditional concept factorization (CF) [1]; 4) semisupervised graph regularized nonnegative matrix factorization on manifold (SemiGNMF) [16]; this method incorporates the label information into the graph structure by modifying the weight matrix; 5) our proposed constrained concept factorization using NMF for initialization (CCF); 6) constrained K-means clustering with background knowledge (COP-Kmeans) [36]; 7) SDA [25]. Evaluations are conducted using a variety of cluster numbers k ranging from 2 to 10. For any given k, k clusters are chosen randomly from the data set, 30% of the samples are used for semisupervised learning and the others for testing. We apply different algorithms for decomposition with rank of the factorization set to k + 1. Once we obtain the data representation, we apply k-means using cosine distance to get the clustering results. For the k-means clustering, k different initial centers are randomly selected and the clustering process is repeated 20 times. The result with minimum objective function in k-means is selected for accuracy and mutual information measurement. We use the same initialization for all methods. We repeat the whole experiment procedure, including data selection, decomposition, and clustering by ten times for each k, and calculate the average performance. The overall average of the metrics of accuracy and mutual information are recorded and listed as follows. In all the experiments, our proposed CCF algorithm consistently outperforms all the other algorithms. Fig. 2 shows the plots of accuracy and normalized mutual information versus the number of clusters of different algorithms for the Yale data set. The detailed values, as well as the standard deviation, are summarized in Table IV. For the Yale face dataset, we observe that the matrix factorization based algorithms obtain better results. It is because Kmeans based algorithms fail to discover the latent semantic information that is very important for the face images. Another fact is that SemiGNMF fails to make full use of the label information, and in some cases performs even worse than CF and NMF. This is because there is no theoretical guarantee for SemiGNMF that data points sharing the same label can be mapped sufficiently close to one another. On the Yale dataset, compared to the second best algorithm CF, CCF achieves 6.5% improvement in accuracy and 8.2% improvement in normalized mutual information. Fig. 3 shows the plots of accuracy and normalized mutual information for the COIL-20 data set. The detailed values, as well as the standard deviation, are summarized in Table V. From the figure, we can see that the advantage of our algorithm is obvious when the number of clusters k is small. With the increase of k, the improvement becomes smaller. But, on

1223

average, on the COIL20 dataset, our CCF outperforms the second best algorithm SemiGNMF by 3.4% in accuracy and 3.9% in mutual information. For the MNIST corpora, the accuracy and normalized mutual information are plotted in Fig. 4. The detailed values, as well as the standard deviation, are summarized in Table VI. Contrarily to the COIL-20, the improvement of our CCF algorithm over other algorithms becomes larger when the number of clusters gets bigger, especially for accuracy. Overall, our CCF shows extraordinary performance than other algorithms. CF improves the accuracy by 9.6% and improves the mutual information by 8.8%. V. Conclusion In this paper, we have presented a novel matrix factorization method, called CCF, which makes use of both labeled and unlabeled data points. CCF imposes the label information to the objective function as hard constraints. In this way, the new representations of the data points have more discriminating power. Moreover, our algorithm is parameter free. Thus, CCF can be easily applied to a wide range of practical problems. The experimental results on three standard face databases have demonstrated the effectiveness of our approach. References [1] W. Xu and Y. Gong, “Document clustering by concept factorization,” in Proc. ACM Int. Conf. Res. Develop. Inform. Retrieval, Jul. 2004, pp. 202–209. [2] D. Tao, X. Li, X. Wu, and S. J. Maybank, “General averaged divergence analysis,” in Proc. IEEE Int. Conf. Data Mining, 2007, pp. 302–311. [3] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, Feb. 2009. [4] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999. [5] D. Cai, X. He, W. V. Zhang, and J. Han, “Regularized locality preserving indexing via spectral regression,” in Proc. 16th ACM Int. Conf. Inform. Knowledge Manage., 2007, pp. 741–750. [6] D. Cai, X. He, and J. Han, “Spectral regression: A unified subspace learning framework for content-based image retrieval,” in Proc. 15th ACM Int. Conf. Multimedia, 2007, pp. 403–412. [7] X. He, D. Cai, and W. Min, “Statistical and computational analysis of locality preserving projection,” in Proc. 22nd Int. Conf. Mach. Learning, 2005, pp. 281–288. [8] J. Ye, Q. Li, H. Xiong, H. Park, R. Janardan, and V. Kumar, “IDR/QR: An incremental dimension reduction algorithm via QR decomposition,” IEEE Trans. Knowledge Data Eng., vol. 17, no. 9, pp. 1208–1222, Sep. 2005. [9] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA: Springer-Verlag, 1989. [10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Hoboken, NJ, USA: Wiley-Interscience, 2000. [11] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Boston, MA, USA: Kluwer, 1992. [12] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. NIPS, 2001, pp. 556–562. [13] S. Li, X. Hou, H. Zhang, and Q. Cheng, “Learning spatially localized, parts-based representation,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit., 2001, pp. 207–212. [14] W. Xu, X. Liu, and Y. Gong, “Document clustering based on nonnegative matrix factorization,” in Proc. Int. Conf. Res. Develop. Inform. Retrieval, Aug. 2003, pp. 267–273. [15] D. Cai, X. He, X. Wu, and J. Han, “Non-negative matrix factorization on manifold,” in Proc. 8th IEEE Int. Conf. Data Mining, 2008, pp. 63–72. [16] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.

1224

[17] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, pp. 788–791, Oct. 1999. [18] D. L. Donoho and M. Elad, “Optimally sparse representation in general (non-orthogonal) dictionaries via l1 minimization,” in Proc. PNAS, vol. 100, no. 5, pp. 2197–2202, 2003. [19] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1,” Vision Res., vol. 37, no. 23, pp. 3311–3325, Dec. 1997. [20] N. Morioka and S. Satoh, “Generalized lasso based approximation of sparse coding for visual recognition,” in Proc. NIPS, 2011, pp. 181–189. [21] Z. Jiang, G. Zhang, and L. Davis, “Submodular dictionary learning for sparse coding,” in Proc. CVPR, 2012, pp. 3418–3425. [22] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996. [23] W. Liu, J. Wang, and S.-F. Chang, “Robust and scalable graph-based semisupervised learning,” Proc. IEEE, vol. 100, no. 9, pp. 2624–2638, Sep. 2012. [24] X. Yang, H. Fu, H. Zha, and J. Barlow, “Semisupervised nonlinear dimensionality reduction,” in Proc. ICML, 2006, pp. 1065–1072. [25] D. Cai, X. He, and J. Han, “Semisupervised discriminant analysis,” in Proc. ICCV, 2007, pp. 1–7. [26] D. Zhang, Z. Zhou, and S. Chen, “Semisupervised dimensionality reduction,” in Proc. SIAM, 2007, pp. 629–634. [27] Z. Zhang, H. Zha, and M. Zhang, “Spectral methods for semisupervised manifold learning,” in Proc. CVPR, 2008, pp. 1–6. [28] X. He, D. Cai, and J. Han, “Learning a maximum margin subspace for image retrieval,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 2, pp. 189–201, Feb. 2008. [29] Y. Zhang and D.-Y. Yeung, “Semisupervised discriminant analysis via CCCP,” in Proc. ECML/PKDD, 2008, pp. 644–659. [30] M. Sugiyama, T. Ide, S. Nakajima, and J. Sese, “Semisupervised local Fisher discriminant analysis for dimensionality reduction,” Mach. Learning, vol. 78, nos. 1–2, pp. 35–61, 2010. [31] Z. Zhang, T. W. Chow, and M. Zhao, “Trace ratio optimizationbased semisupervised nonlinear dimensionality reduction for marginal manifold visualization,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1148–1161, May 2013. [32] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. B (Methodol.), vol. 39, no. 1, pp. 1–38, 1977. [33] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA, USA: MIT Press, 2001. [34] G. W. Stewart, Matrix Algorithms Volume I: Basic Decompositions. Philadelphia, PA, USA: SIAM, 1998. [35] F. Sha, Y. Lin, L. Saul, and D. Lee, “Multiplicative updates for nonnegative quadratic programming,” Neural Comput., vol. 19, no. 8, pp. 2004–2031, 2007. [36] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl, “Constrained k-means clustering with background knowledge,” in Proc. Int. Conf. Mach. Learning, 2001, pp. 577–584.


Haifeng Liu (M’11) is an Associate Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. Her current research interests include the field of machine learning, pattern recognition, web mining, and information dissemination.

Genmao Yang received the B.S. degree in computer science from Zhejiang University, Hangzhou, China, in 2012. He is currently pursuing the M.S. degree in computer science at the College of Computer Science, Zhejiang University. His current research interests include machine learning, computer vision, and information retrieval.

Zhaohui Wu (M’00–SM’05) received the Ph.D. degree in computer science from Zhejiang University, Hangzhou, China, and Kaiserslautern University, Kaiserslautern, Germany, in 1993. He is currently a Professor with the College of Computer Science and the Vice Principal with Zhejiang University. His current research interests include distributed artificial intelligence, grid computing and systems, and embedded ubiquitous computing. Dr. Wu is a Senior Member of the IEEE Computer Society.

Deng Cai (M’09) received the Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign, Urbana, IL, USA, in 2009. He is a Professor with the State Key Laboratory of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, China. His current research interests include machine learning, data mining, and information retrieval.

Pairwise constrained concept factorization for data representation.

Locality-constrained Subcluster Representation Ensemble for lung image classification.

3-D Lung Segmentation by Incremental Constrained Nonnegative Matrix Factorization.

CONSTRAINED SPECTRAL CLUSTERING FOR IMAGE SEGMENTATION.

Group-based sparse representation for image restoration.

Transfer representation learning for medical image analysis.

MCA-NMF: Multimodal Concept Acquisition with Non-Negative Matrix Factorization.

Structure constrained semi-nonnegative matrix factorization for EEG-based motor imagery classification.

Concept Representation Reflects Multimodal Abstraction: A Framework for Embodied Semantics.

Learning nonrigid deformations for constrained multi-modal image registration.

Sparsity-constrained three-dimensional image reconstruction for C-arm angiography.

Context-aware and locality-constrained coding for image categorization.

An Inexact Newton-Krylov Algorithm for Constrained Diffeomorphic Image Registration.

High Resolution Local Structure-Constrained Image Upsampling.

Inter-subject neural code converter for visual image representation.

A Robust Sparse Representation Model for Hyperspectral Image Classification.

Directionlets using in-phase lifting for image representation.

Use of customizing kernel sparse representation for hyperspectral image classification.

Matrix variate distribution-induced sparse representation for robust image classification.

High-order statistics of weber local descriptors for image representation.

Nonlinear Image Representation Using Divisive Normalization.

Sparse Non-negative Matrix Factorization (SNMF) based color unmixing for breast histopathological image analysis.

Locality constrained joint dynamic sparse representation for local matching based face recognition.

Decomposing time series data by a non-negative matrix factorization algorithm with temporally constrained coefficients.