2194

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

A Unified Framework for Data Visualization and Coclustering Lazhar Labiod and Mohamed Nadif Abstract— We propose a new theoretical framework for data visualization. This framework is based on iterative procedure looking up an appropriate approximation of the data matrix A by using two stochastic similarity matrices from the set of rows and the set of columns. This process converges to a steady state where the approximated data Aˆ is composed of g similar rows and l similar columns. Reordering A according to the first left and right singular vectors involves an optimal data reorganization revealing homogeneous block clusters. Furthermore, we show that our approach is related to a Markov chain model, to the double k-means with g × l block clusters and to a spectral coclustering. Numerical experiments on simulated and real data sets show the interest of our approach.

Index Terms— Coclustering, data visualization, power method, stochastic data. I. I NTRODUCTION From a data matrix, to reveal homogeneous blocks structure, many standard clustering methods can be used: the set of rows, hereafter called I and the set of columns, hereafter called J need to be clustered and the data matrix must be reorganized according to the resulting partitions. However, this procedure is not efficient as it works on the two sets separately, because it is necessary to consider the two sets simultaneously. Therefore, the methods called block clustering methods which consider the two sets simultaneously and organize the data into homogeneous blocks are more suitable. The earliest coclustering formulation is due to [5]. The basic idea of these methods consists in making permutations of objects and attributes to draw a correspondence structure on I × J . These past few years, block clustering, also called biclustering, coclustering, or two-mode data analysis, has received a significant amount of attention as an important problem with many applications in the context of data mining. For data sets arising in text mining and bioinformatics, where data is represented in a very high-dimensional space, clustering both dimensions of data matrix simultaneously is often more desirable than traditional one-side clustering. In algorithmic terms, coclustering often consists in interlacing row clusterings with column clusterings at each iteration [7]; hence it exploits the duality between rows and columns which allows to efficiently deal with high-dimensional data. For this task, many approaches can be used, Dhillon [3] have developed a spectral coclustering algorithm on word-document data, the largest left and right singular vectors of the normalized word-document matrix are computed and then a final clustering step using k-means is applied to the data projected to the topmost singular vectors. In [2], with information-theoretic, Dhillon et al. have proposed a coclustering algorithm that presents a nonnegative matrix as an empirical joint probability distribution of two discrete random variables and set the coclustering problem under an optimization problem. Probabilistic model-based clustering techniques have also shown promising results Manuscript received September 28, 2013; revised July 18, 2014; accepted September 14, 2014. Date of publication November 3, 2014; date of current version August 17, 2015. The authors are with the Department of Mathematics and Computer Sciences, LIPADE University Paris Descartes, Paris 75270, France (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2359918

in several coclustering situations. For example, coclustering of binary data can be treated by using the latent block Bernoulli model [4]. It is worth noting that, even when the aim is clustering, for large and sparse data, the coclustering achieves better results because, it implicitly performs an adaptive dimensionality reduction at each iteration and then overcomes the high dimensionality and sparsity of data, leading, for example, to better document clustering accuracy compared with one-side clustering methods. Despite the advantages of coclustering, all the methods require the knowledge of the number of blocks. In this brief, we will not tackle coclustering but we will see how an appropriate visualization of data involves a reorganization into homogeneous blocks. We consider that the approach of data visualization can be used to understand data better. Prominent authors in the field of information visualization [8] have identified that the data-mining community gives minimal attention to information visualization, but believe that there are hopeful signs that the narrow bridge between data mining and information visualization will be expanded in the coming years. Bertin [12] has described the visualization procedure as simplifying without destroying and was convinced that simplification was no more than regrouping similar things. Spath [10] considered that such matrix permutation approaches had a great advantage as opposed to cluster algorithms, because no information of any kind is lost, and because the number of clusters does not have to be presumed, it is easily and naturally visible. Arabie and Hubert [11] have referred to similar advantages calling the approach a nondestructive data analysis, emphasizing its essential property that no transformation or reduction of the data itself actually takes place. For a detailed overview on the information visualization methods, see [13] for instance. Setting the detection of homogeneous blocks in data visualization approaches, we propose a new theoretical framework, namely. 1) We develop an efficient iterative procedure to find an optimal simultaneous reordering of objects and attributes. We show that the solution is given as the steady state of a Markov chain process. 2) We establish the relationship with the double k-means optimization criterion, spectral coclustering and the random walk analysis. We show therefore that both approaches have the same objective. 3) Unlike known coclustering methods, we show that our approach seems to have the potential to address the problems of the optimal choice of the number of clusters by allowing a natural identification of clusters. The rest of the brief is organized as follows. Section II introduces the problem formulation and aims of this brief. Section III is devoted to the proposed algorithm for data reordering. Section IV explains why our algorithm works well and why it is useful, particularly in the coclustering context. Section V presents numerical experiments on real and simulated data. Finally, Section VI concludes and summarizes the advantages of our contribution. II. I TERATIVE S TOCHASTIC M ATRIX A PPROXIMATION F RAMEWORK An interesting connection between data matrices and graph theory can be established. Let A be a m × n data matrix, it can be seen as a

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

weighted bipartite graph G = (V, E), where V is the set of vertices and E is the set of edges, G = (V, E) is said to be bipartite if its vertices can be partitioned into two sets I and J such that every edge in E has exactly one end in I and the other in J : V = I ∪ J . The data matrix A can be viewed as a weighted bipartite graph where each node i in I corresponds to a row and each node j in J corresponds to a column. The edge between i and j has weight ai j , denoting the element of the matrix in the intersection between a row i and a columns j . In this brief, we stipulate the existence of a homogeneous block structure of the data, this makes it possible to have a block diagonal matrix after an appropriate permutation. Given a rectangular data matrix A, a similarity function s(a y , az ) is a function where s(a y , az ) = s(az , a y ), and s(a y , az ) > 0 if y = z and s(a y , az ) = 0 if y = z. Below we define two stochastic similarity matrices as follows: r = Sr Dr−1 and c = Dc−1 Sc

(1)

where Sr and Sc are, respectively, the row similarity and column similarity matrices, Dr = diag(Sr ½) and Dc = diag(Sc ½) are the degree matrices Associated, respectively, with Sr and Sc , ½ is the vector of the appropriate dimension whose values are equal to 1. In prior work by Melia and Shi [15], it is noted that for many natural problems, r and c are approximately block stochastic matrices, and hence the first g left eigenvectors of r are approximately piecewise constant over the g almost invariant rows subsets. The l left eigenvectors of c are also approximately piecewise constant over the l almost invariant columns invariant subsets. Because r is a column stochastic matrix (rT ½ = ½) and c is a row stochastic matrix (c ½ = ½), the following iterative process: (t ) (t ) Aˆ (t +1) = r A (t ) c with Aˆ (0) = A

(2)

will converge to the approximated data Aˆ where each row and each column moves toward its prototype. In others words, this process converges to an equilibrium (steady) state. Let g be the multiplicity of the eigenvalue of matrix r equal to 1 and l be the multiplicity of the eigenvalue equal to 1 of matrix c , matrix Aˆ is composed of g  m quasi similar rows and quasi similar columns l  n, where each row and each column is represented by its prototype. Finally, the optimal reordering of the rows and columns is given by the sorted first left and right singular vectors of Aˆ guaranteeing a mutual reinforcement between rows similarities and columns similarities and allowing therefore the duality between the set of rows and the set of columns. III. ISMA A LGORITHM FOR DATA V ISUALIZATION Once the stochastic similarity matrices r and c are obtained using (1), the basic idea of the iterative stochastic matrix approximation (ISMA) algorithm consists in: 1) estimating iteratively A by applying at each time the matrices r and c on the current A using the following update: Aˆ (t +1) = r A (t ) c .

(3)

This process converges to an equilibrium (steady) state; 2) extracting the first left and right singular vectors πr and πc of Aˆ using one iteration of the Power Method [1] which is a well-known technique for computing the largest left and right eigenvectors of data matrix. For numerical computation ˆ we use a new variant of of the leading singular vectors of A, this method (Algorithm 1) adapted to the case of rectangular (0) data matrix. It consists in starting with an arbitrary vector πr ,

2195

Algorithm 1: Modified Power Method Input: data Aˆ (0)

Initialize: πr

(0) (0) = Aˆ ½, πr = πr(0)

||πr ||

repeat

(t +1)

(t +1)

ˆ r(t ) , πc(t +1) ← πc = Aπ (t +1) ||πc ||

(4)

(t +1)

πr (t +1) (t +1) = Aˆ T πc , πr ← (t +1) ||πr ||

(5)

πc πr

(t +1)

until stabilization of πr and πc

Algorithm 2: ISMA for Coclusters Visualization Input: data A, r , and c ˆ sorted πr , and πc Output: data A, repeat (t ) (t ) A (t +1) = r A (t ) c , γ (t +1) ← ||A (t +1) − A (t ) ||2 (t +1) (t ) until |γ −γ |0 Deduce: Aˆ Compute: πr and πc using Algorithm 1. Reorganize: A according to sorted πr and πc

repeatedly performing the updates of πc and πr by alternating (4) and (5) until convergence. The ISMA algorithm involves a reorganization of the rows and the columns of data matrix A according to sorted πr and πc . It also allows to locate the points corresponding to an abrupt change in the curve of the first left and right singular vectors πr and πc , and then assesses the number of clusters and the rows or columns belonging to each cluster. The main steps of the ISMA algorithm are summarized in Algorithm 2. In ISMA, each row and column of the data actually moves to its prototype, respectively, for each row and for each column we obtain a new point by applying one step of the ISMA algorithm. Thus, a new iteration of ISMA results in a new data matrix Aˆ which is a new version of A. By iterating this process, ISMA yields a sequence of approximated data matrices ( Aˆ (1) , Aˆ (2) , . . .) and a (1) (2) sequence of row stochastic similarity matrices (r , r , . . .), and (1) (2) a sequence of column stochastic similarity matrices (c , c , . . .) (0) (t +1) ˆ ˆ is the original data matrix and A is obtained where A by applying ISMA to Aˆ (t ) . At first sight, ISMA seems uninteresting since it eventually leads to a data matrix in which both rows and columns coincide for any starting data matrix. However, our practical experience shows that, first the data matrix collapses very quickly into row and column blocks and these blocks move toward each other relatively slowly. If we stopped the ISMA iteration at this point, the algorithm would have a potential application for data reordering and coclustering. In ISMA, we iterate the product A = r Ac , if r and c were kept constant, this would be the power method [1] and A would converge to the leading left and right singular vectors (the vectors of ones, i.e., a single block) with the rate of convergence given by the second eigenvalue. However, the dynamics of ISMA can be more complex, if we also allow r and c to change after each iteration. In practice, r , c , and A quickly reach simultaneous quasi-stable row states and column states where rows and columns have collapsed in blocks which slowly approach each other. Thus ISMA can be seen as a refining of the affinity matrix A into an almost blocky matrix and then (trivially)

2196

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

extracting the leading singular vectors (the left singular vector to reorder the rows and the right singular vector to reorder the columns, using the power method). IV. R ELATED W ORKS It is worth understanding the intuition behind ISMA and why it can perform well in data visualization. First, we present the ISMA algorithm as a Markov chain model; second, we link it to a spectral coclustering technique; finally, we show that ISMA is closely related to various relaxed double k-means. A. Relationship With Markov Chain Model The iterative formula in ISMA (3) can be decomposed into two random walk subproblems as follows: A (t +1/2) = A (t ) c and A (t +1) = r A (t +1/2)

(6)

where the left formula of (6) represents a random walk performed on J governed by the stochastic matrix c , and the right formula represents a random walk on I governed by the stochastic matrix r . Multiplying both sides of the right formula of (6) by ½ produces A (t +1) ½ = r A (t +1/2)½. (t +1)

(t )

With πr = A (t +1) ½ and A (t +1/2)½ = πr , we can define (t +1) (t ) = r πr . Using the a random walk model on I by πr same procedure, we can derive a random walk model on J by (t +1) (t ) = cT πc . πc In this section, we state results that link certain properties of the eigenvalues and eigenvectors of a stochastic similarity matrix r to a block diagonal or perturbed block diagonal structure of the matrix. The main idea is to identify an almost block diagonal structure in order to find the best reordering of the data matrix. To demonstrate what will be the outcome of this random walk we consider a Markov chain with multiple irreducible components. Let us consider the case in which the currently Markov chain consists of multiple irreducible components. Let that be noted I1 , I2 , . . . , Ig where Ii ∩ I j = ∅, Ik ⊂ I , 1 ≤ i = j, k ≤ g. What will be the outcome of a random walk performed on the set of states I according to the transition matrix r ? Following the Perrons–Frobenius theorem [14] all the eigenvalues are real and are contained in [−1, 1]. Because r is stochastic, for every right eigenvector there is a corresponding left eigenvector associated to the same eigenvalue λ1 = 1 which is called the Perron root. The associated right eigenvector is vector ½ and the associated left eigenvector is πr = 1/|I |½ representing the equilibrium distribution so that πrT ½ = 1. In the matrix notation, we have πr = r πr and rT ½ = ½. The results above are verified for a general Markov matrix, we will now focus on the submatrices of r : we can decompose the set of states (objects) into invariant subsets (groups or blocks) I1 , I2 , . . . , Ig . This means that whenever the Markov process is in one of the invariant sets, for example, I1 , it will remain in I1 and thereafter in I1 . If we make an appropriate ordering of the data objects (rows), the stochastic similarity matrix r appears g in a block diagonal form r = Diag(r1 , . . . , r ), where each k block r (k = 1, . . . , g) is a Markov matrix. Again due to the Perron–Frobenius theorem each block possesses a unique right eigenvector ½k , of length |Ik |, corresponding to its Perron root λk = 1, also related to the left eigenvector πrk . In terms of the total stochastic similarity matrix r , the eigenvalue λ1 is g-fold and g corresponding left eigenvectors can be written as linear combinations of the g left vectors of the form (0, . . . , 0, π k , 0 . . . , 0)T induced by each block rk of length |Ik | where π k = 1/|Ik |½, k = 1, . . . , g. As a result,

left eigenvectors corresponding to λ1 = 1 are constant on each g invariant set of state πr = (α1 πr1 | . . . |αk πrk | . . . |αg πr |) where αk is (0) a constant depending on the initial condition πr . The structure (t ) of πr during short-run stabilization makes the discovery of data (t ) ordering straightforward. The key is to look for values of πr which are approximately equal. B. Relationship With Spectral Coclustering ISMA is related to spectral coclustering in that it finds a lowdimensional embedding of data, and k-means or another clustering techniques are used to produce the final coclustering. But as a result, in this brief it is not necessary to find any singular vector (as most coclustering methods do), to find a low-dimensional embedding for coclustering, the embedding just needs to be a good linear combination of the left singular vectors and of the right singular vectors respectively. In this respect, ISMA is a very different approach from spectral coclustering. In spectral coclustering, the embeddings is formed by the bottom left and right eigenvectors of a normalized data matrix [3]. In ISMA, embedding is defined as a weighted linear combination of singular vectors, then πr is defined as a linear combination of all the left singular vectors of Aˆ and πc as weighted linear combination of all ˆ The left and right embedding turn out to right singular vectors of A. be very interesting for data reordering and coclustering. From the start, the first largest left and right singular vectors of Aˆ are not very interesting, because they move toward the uniform distribution via a long run time. However, the intermediate πr and πc obtained by ISMA after a short run time are very interesting. The experimental observation suggests that an effective reordering might run ISMA for a few iterations. Let us define the (n + m) by (n + m) data matrix   0 Aˆ M = ˆT 0 A where Aˆ is normalized1 in a similar way to [3]. M is then diagonalizable, there exists a nonsingular matrix Q of eigenvectors q1 , q2 , . . . , qn+m such that Q −1 M Q = diag(λ1 , λ2 , . . . , λn+m ).

(7)

Assuming that the eigenvalues are ordered |λ1 | > |λ2 | |λ3 | ≥ · · · ≥ |λn+m | and expanding the initial approximation   (0) πr (0) π = (0) πc in terms of the eigenvectors of M such that π (0) = c1 q1 + c2 q2 + · · · + cn+m qn+m with qk =



uk vk





where the upper part u k is for the rows of Aˆ and the lower part v k is for the columns of Aˆ (ck = 0 is assumed), we have π (t ) = M(t ) π (0)  (t )  (t )  (t ) = c1 λ1 q1 + c2 λ2 q2 + · · · + cn+m λn+m qn+m ⎛ ⎞ n+m  λk (t ) (t ) ⎝ ck qk ⎠. c1 q1 + (8) = λ1 λ1 k=2

As we have |λi | < |λ1 |, for k = 2, . . . , n + m each of the factors (λk /λ1 )(t ) must approach 0 as t approaches infinity. Therefore, 1 Aˆ ← D −1/2 AD ˆ c−1/2 where Dr = diag( Aˆ ½) and Dc = diag( Aˆ T ½). r

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

the second term tends to zero, this implies that the power method converges to the eigenvector   u1 q1 = v1 corresponding to the dominant eigenvalue λ1 . The rate of convergence is determined by the ratio |λ2 /λ1 |, if this is close to one, the convergence is very slow. In (8), we expanded the left eigenvector π (t ) as a linear combination of the eigenvectors of M(t ) . It is easy to see that   (t ) πr (t ) π = (t ) πc (t )

(t )

is the leading eigenvector of M (t ) where πr and πc are the left and right singular vectors of Aˆ (t ) , respectively. This implies that      (0)   (t ) (0)  (t ) πr Aˆ πc πr 0 Aˆ (t ) (9) (t ) = Aˆ T (t ) 0 (0) = (0) . Aˆ T (t ) πr πc πc More interestingly, (9) provides the updated (4) and (5) given in Algorithm 1. Also, instead of constructing M (as the most spectral coclustering methods do) which is bigger and sparser than the ˆ we provide a way to cocluster data, not using M approximated data A, ˆ but directly A. Now, by exploiting the block structure of M in (8), we can (t ) deduce πr as a linear combination of the left singular vectors of Aˆ (t ) ˆ and πc as a linear combination of the right singular vectors of A. This implies ⎤ ⎡ n  (t )   ck ( λλk )(t ) u k ) λ1 (c1 u 1 + (t ) 1 ⎥ ⎢ πr k=2 ⎥. ⎢ (10) n+m (t ) = ⎣ (t ) ⎦  λ πc λ1 (c1 v 1 + ck ( λk )(t ) v k ) k=n+1

1

C. Relationship With the Double k-Means Criterion The detection of homogeneous blocks in A can be reached by looking for the three matrices R, C, and S minimizing the total squared residue measure J (A, RSC T ) = ||A − RSC T ||2.

(11)

where Rm×g and Cn×l are the two clustering index matrices and Sg×l is the summary matrix. The term RSC T characterizes the information of A that can be described by the cluster structures. The clustering problem can be formulated as a matrix approximation problem where the clustering aim is to minimize the approximation error between the original data A and the reconstructed matrix based on the cluster structures. For a fixed coclustering, the matrix summary S can be expressed as S = Dr−1 R T AC Dc−1 where Dr−1 ∈ Rg×g and Dc−1 ∈ Rl×l are two diagonals matrices defined as: Dr−1 = diag−1 (R T ½) and Dc−1 = diag−1 (C T ½). Plugging S into the objective function (11), the expression to optimize becomes ||A − R(Dr−1 R T AC Dc−1 )C T ||2 = ||A − RR T ACCT ||2 where R = R Dr−0.5 and C = C Dc−0.5 . Let r = RR T and c = CCT . r and c are at least both nonnegative, stochastic with a constant trace. Thus, double k-means can be considered as the means to look for the best approximation Aˆ = r Ac ,

2197

TABLE I R EORGANIZATION OF T OWNSHIPS AND C HARACTERISTICS

which is quite similar to our iterative formula in the ISMA algorithm. There is more than one way to formulate the relaxed double k-means c ||A − r Ac ||2 in the double k-means optimization, the parameters g and l need to be fixed in advance whereas ISMA does not require this knowledge. It provides a result, from which, a coclustering can be deduced with the good number of rows and columns clusters. V. N UMERICAL E XPERIMENTS We provide experimental results to illustrate the behavior of the ISMA algorithm. We argue that ISMA makes it possible to capture the trends of objects over a subset of attributes and then reorganize data matrix into homogeneous blocks. We apply our algorithm on different real world and simulated data sets with different patterns to show the ability of ISMA to rediscover the hidden blocks in data without fixing any parameters on the ordering of the rows and columns. The ISMA framework has the potential to address the question of the number of clusters underlying the data, to detect the suitable number of row and column clusters by analyzing the evolution of the first right and left singular vectors πr and πc of the ˆ Furthermore, we show that ISMA has a very approximated matrix A. interesting application in the context of coclustering. Coclustering may be used as a way to validate data reordering. A. Illustrative Example Hereafter, we illustrate the reordering aim by an example which consists of nine characteristics (rows) and 16 townships (columns); each cell indicates the presence 1 or absence 0 of a characteristic of a township. Using (1), we first build r and c separately, we ˆ we obtain two stochastic matrices. To have the approximation A, apply Algorithm 2 and we obtain the first left and right singular ˆ Finally, the 16 townships data is reorganized into vectors of A. homogenous blocks. This example has been used by Niermann in [6] for data-ordering task and the author aims to reveal a diagonal of homogeneous blocks (Table I). Obviously, Table I is more concise than the original data set where the cities are given in alphabetical order {A, B, C, …, P} and the characteristics in random order. It clearly appears that we can characterize each cluster of townships by a cluster of characteristics; for instance, {H, K} by {High School, Railway Station, Police Station}. The ISMA algorithm makes it possible to identify groups of towns and groups of characteristics that exhibit similar patterns. It is easy to observe (see Fig. 1) that the appropriate number of blocks is equal to 3 × 3. To illustrate the performance of ISMA, we propose to visualize a synthetic (2000 × 500) data set (SIM) generated according to a latent block Bernoulli mixture model with 3 × 3 blocks [4] (Fig. 2). The reordering task is to reorganize data into homogeneous blocks. After the learning stage, the order of the rows and the order of columns are given by the sorted vectors πr and πc . We then reorganize the rows and columns separately or simultaneously according to

2198

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Fig. 1. Townships: reordered Sr = A AT according to πr and reordered Sc = AT A according to πc .

ˆ and A according to πr and πc simultaneously. Fig. 2. SIM: A, reordered A,

ˆ and A according πr and πc simultaneously. Fig. 4. Classic30: A, reordered A,

Fig. 5. Number of ISMA iterations necessary to achieve (left) γ (t+1) = || A(t+1) − A(t) ||2  0 and (right) Acceleration = |γ (t+1) − γ (t) |  0. TABLE III C LUSTERING A CCURACY AND N ORMALIZED M UTUAL I NFORMATION (%)

Fig. 3. SIM: reordered Sr = A AT according to πr and reordered Sc = AT A according to πc . TABLE II C ONFUSION M ATRIX E VALUATION ON R OWS AND C OLUMNS D ATA

the obtained sorted πr and πc in Fig. 3. The plots of sorted πr and πc are shown on the right of these figures. It can be seen that our method reconstructs effectively all blocks. The sorted πr and πc plots show that the number of abrupt changes in each plot corresponds to the true number of blocks. B. Does This Data Reordering Make Any Sense? A performance study has been conducted to evaluate our method. In this section, we try to answer the following question: is this reordering meaningful? To answer this question, we use confusion matrices measuring the clustering performance of the coclustering result provided by our method. After the learning stage, the clusters indicators are given by vectors πr and πc . It can be seen that our method reconstructs efficiently all coclusters. From Table II, we observe that data reordering provided by ISMA can be useful in a coclustering context. It is very interesting to underline the fact that ISMA visualization does not destroy data and, unlike most coclustering algorithms, it does not require the knowledge of the number of blocks.

To evaluate ISMA in terms of clustering, we used six data sets whose characteristics are reported in Table III. The first five data sets, commonly used in document clustering, are word-by-document matrices whose rows correspond to documents, and columns to words; each cell denotes the frequency of a word in a document. Classic30, Classis400, and Classic3 [3] contain, respectively, three classes namely Medline, Cisi, and Cranfield. CSTR consists of abstracts, which were divided into four research areas: Natural Language Processing (NLP), Robotics/Vision, Systems, and Theory. WebKB4 consists of four webpages collected from computer science departments: student, faculty, course, and project. The last data set Leukemia contains expression levels of genes taken over samples. This data set is well known in the academic community, it can be divided into three (ALL-B/ALL-T/AML) clusters. We compared the performance of ISMA with the spectral coclustering [3] discussed in Section IV-B by using two evaluation metrics: accuracy (ACC) corresponding to the percentage of wellclassified elements and the normalized mutual information (NMI) [9]. In Table III, we observe that ISMA outperforms the spectral coclustering in all situations. Fig. 4 shows that ISMA rediscovers well the block structure of Classic30 data. Furthermore, ISMA needs only few iterations to achieve |γ (t +1) − γ (t ) |  0. The convergence behavior is empirically illustrated on CSTR data set (Fig. 5).

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

VI. C ONCLUSION In this brief, we presented a method called iterative stochastic matrix approximation for data visualization. The procedure consists in applying two stochastic matrices to the current data A in an iterative style, and then compute the first largest left and right singular vectors ˆ associated with the eigenvalue λ1 of the approximated matrix A. The leading singular vectors are then sorted in the same direction and the data matrix is reordered accordingly. Thus, the final result is a visualization of data matrix into homogeneous blocks. One of the interesting applications of the proposed procedure is coclustering; a simultaneous clustering of the rows and columns of the data can be deduced from the output of ISMA by studying the evolution of the elements of the first left and right singular vectors. A set of experiments highlights the benefits of reorganizing the data matrix. In the future, we plan to develop a more scalable version of ISMA using only matrix vector multiplication. Finally, we expect to study the behavior of the ISMA version where r and c are estimated iteratively. R EFERENCES [1] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Baltimore, MD, USA: The Johns Hopkins Univ. Press, 1996, pp. 330–332. [2] I. S. Dhillon, S. Mallela, and D. S. Modha, “Information-theoretic co-clustering,” in Proc. 9th ACM SIGKDD Int. Conf. KDD, 2003, pp. 89–98.

2199

[3] I. S. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,” in Proc. 7th ACM SIGKDD Int. Conf. KDD, 2001, pp. 269–274. [4] G. Govaert and M. Nadif, “Block clustering with Bernoulli mixture models: Comparison of different approaches,” Comput. Statist. Data Anal., vol. 52, no. 6, pp. 233–3245, 2008. [5] J. A. Hartigan, “Direct clustering of a data matrix,” J. Amer. Statist. Assoc., vol. 67, no. 337, pp. 123–129, 1972. [6] S. Niermann, “Optimizing the ordering of tables with evolutionary computation,” Amer. Statist., vol. 59, no. 1, pp. 41–46, 2005. [7] L. Labiod and M. Nadif, “Co-clustering under nonnegative matrix trifactorization,” in Proc. 18th ICONIP, 2011, pp. 709–717. [8] B. B. Bederson and B. Shneiderman, The Craft of Information Visualization: Readings and Reflections. San Mateo, CA, USA: Morgan Kaufmann, 2003. [9] A. Strehl and J. Ghosh, “Cluster ensembles—A knowledge reuse framework for combining multiple partitions,” J. Mach. Learn. Res., vol. 3, pp. 583–617, Mar. 2002. [10] H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Chichester, U.K.: Ellis Horwood, 1980. [11] P. Arabie and L. J. Hubert, “An overview of combinatorial data analysis,” in Clustering and Classification. River Edge, NJ, USA: World Scientific, 1996, pp. 5–63. [12] J. Bertin, Graphics and Graphic Information Processing. San Mateo, CA, USA: Morgan Kaufmann, 1999, pp. 62–65. [13] I. Liiv, “Seriation and matrix reordering methods: An historical overview,” Statist. Anal. Data Mining, vol. 3, no. 2, pp. 70–91, 2010. [14] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge, U.K.: Cambridge Univ. Press, 1986. [15] M. Melia and J. Shi, “A random walks view of spectral segmentation,” in Proc. AISTATS, 2001.

A Unified Framework for Data Visualization and Coclustering.

We propose a new theoretical framework for data visualization. This framework is based on iterative procedure looking up an appropriate approximation ...
1MB Sizes 3 Downloads 6 Views