IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

1553

Mean Vector Component Analysis for Visualization and Clustering of Nonnegative Data Robert Jenssen, Member, IEEE

Abstract— Mean vector component analysis (MVCA) is introduced as a new method for visualization and clustering of nonnegative data. The method is based on dimensionality reduction by preserving the squared length, and implicitly also the direction, of the mean vector of the original data. The optimal mean vector preserving basis is obtained from the spectral decomposition of the inner-product matrix, and it is shown to capture clustering structure. MVCA corresponds to certain uncentered principal component analysis (PCA) axes. Unlike traditional PCA, these axes are in general not corresponding to the top eigenvalues. MVCA is shown to produce different visualizations and sometimes considerably improved clustering results for nonnegative data, compared with PCA. Index Terms— Clustering, eigenvalues (spectrum), eigenvectors, inner-product matrix, mean vector, nonnegative data, principal component analysis, visualization.

I. I NTRODUCTION

V

ISUALIZATION and clustering are of utmost importance in learning systems, and can be approached in a variety of ways [1]–[3]. A common feature of both tasks is that a dimensionality reduction of the input data to be analyzed is often carried out. A very old, but still very dominant method for dimensionality reduction relevant for this exposition, is principal component analysis (PCA) [4]–[8]. PCA performs dimensionality reduction by projecting the original data onto a subspace spanned by the eigenvectors of the data correlation (or covariance) matrix corresponding to the top eigenvalues (spectrum). These are the principal axes and the projected data are called the principal components. The principal components can be obtained directly from the eigenvectors of the data inner-product matrix, which share eigenvalues with the correlation matrix. If the dimension of the subspace is two or three, a visualization of the underlying structure of the high-dimensional input data is obtained. A clustering of the input data into groups [9] may be accomplished by executing, for example, the k-means algorithm [10] on the principal components. PCA is also commonly used as a preprocessing step when using more advanced data analysis techniques for, e.g., classification. Although PCA is commonly used for clustering and preprocessing, it is well known that PCA does not necessarily preserve any cluster structure present in the data.

Manuscript received April 13, 2012; revised December 14, 2012; accepted May 1, 2013. Date of publication June 10, 2013; date of current version September 27, 2013. The author is with the Department of Electrical Engineering, Department of Physics and Technology, University of Tromsø, Tromsø N-9037, Norway (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2262774

In most cases, PCA assumes a centering of the input data such that the mean is zero. The centering is convenient and enables PCA to be derived as a maximum variance retaining method, or as a minimum mean square error reconstruction method [1]–[3]. PCA can also be shown to best preserve the 2-norm, or Frobenius (Hilbert–Schmidt) norm, of the innerproduct matrix when represented as a vector (Section II). Recently, several extensions to PCA have appeared [11]–[14]. Many other advanced dimensionality reduction methods using some top eigenvectors, e.g., for clustering, are also developed [15]–[20], differing in the way the data matrices are defined. See also [21] and [22]. In this paper, the aim is to bring out possible clustering structure in nonnegative data, primarily in the form of a visualization or a grouping of the data, or in general, as a preprocessing step. Nonnegative data is abundant in important domains such as images and bag-of-word models. To that end, a new method for dimensionality reduction is proposed for which nonnegative data can be shown to capture the clustering structure of the data. This is a beneficial property compared with PCA. The approach taken is completely different compared with traditional PCA, assuming centered data, as the mean vector of the nonnegative data will be placed at the center of the analysis. The novel concept introduced here is to maximally preserve the squared-Euclidean length, and implicitly also the direction, of the mean vector of the data when performing dimensionality reduction. The mean, after all, provides a very important descriptor of the data for example, in methods such as Bayesian classification under Gaussianity assumptions, Fisher discriminant analysis or k-means clustering [1]–[3]. The key points and contributions of this paper are as follows. 1) A new dimensionality reduction method is derived that maximally preserves the squared-Euclidean length, and implicitly also the direction, of the mean vector of the nonnegative data. 2) The solution is given by certain eigenvalues and eigenvectors of the resulting nonnegative inner-product matrix of the uncentered data, referred to as mean vector values, that are not necessarily corresponding to the top eigenvalues. The resulting method is fittingly called mean vector component analysis (MVCA). 3) An analysis showing that MVCA preserves the 1-norm of the nonnegative inner-product matrix and thereby naturally possess the beneficial property of capturing the clustering structure in the data, is carried out. 4) Because of the close connection between MVCA and uncentered PCA, the proposed method sheds new light

2162-237X © 2013 IEEE

1554

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

on the issue of centering in PCA, and the experiments and analysis are as a result primarily focused on highlighting differences between MVCA and PCA. As mentioned, MVCA is closely linked to uncentered PCA (Section II). It is interesting in this context to note that in a paper using PCA for face recognition [23], based on a nonnegative inner-product matrix, it was observed that eigenvectors different from the top ones provided discriminative power. In the following, several experiments are conducted which show that MVCA produce different visualizations and sometimes considerably improved clustering results compared with PCA. This paper is organized as follows. Section II starts by reviewing the properties of PCA. MVCA is then introduced in Section III. In addition, Section IV describes an analysis of MVCA for nonnegative data, and in Section V some related work is discussed. Section VI describes MVCA in visualization, and Section VII describes clustering results using MVCA, and finally, Section VIII summarizes the conclusion of this paper. II. P RINCIPAL C OMPONENT A NALYSIS Let the data points xi ∈ Rd , i = 1, . . . , N, be represented by a real matrix X ∈ Rd×N . It is well known that the innerproduct matrix X X and the outer-product matrix XX , which is the correlation matrix up to a constant N, share r nonzero eigenvalues, λi , i = 1, . . . , r , where r is the shared rank of the product matrices, r = min(d, N). Let vi be an eigenvector of X X and ui be an eigenvector of XX both associated with the eigenvalue λi (ui and vi are column vectors). These eigenvectors are related by ui = λ−1/2 Xvi . See [24] for more details. In PCA, the vectors ui , i = 1, . . . , rˆ , where rˆ < r , corresponding to top eigenvalues λi , i = 1, . . . , rˆ , provide an interesting basis for representing the data X. The projection onto ui takes the following form:  projui X = ui X = λi vi . (1) The correlation matrix of the data projected onto a space spanned by these top rˆ basis vectors will be diagonal, with λi , i = 1, . . . , rˆ , on the diagonal. The principal components √ associated with the i th basis vector ui is therefore λi vi . A. Centered Versus Uncentered PCA Most expositions on PCA assume that a (row) centering of the data is performed, such that the data set has zero mean vector m, i.e., m = X1m = 0, 1m = 1/N 1 and 1 and 0 are (N × 1) vectors of ones and zeros, respectively. The centering is convenient, as the correlation matrix then equals the covariance matrix, in which case the variance ˆ λi . Hence, the variance of the projected data will be ri=1 of the original data is maximally retained when reducing the dimension according to the top eigenvalues. In addition, reconstruction of the original data can be guaranteed to be optimal in a square error sense. See also [25].

12 10 8 6 4 2 0 −2 −2

0

2

4

6

8

10

12

Fig. 1. Illustration of centered and uncentered PCA. The principal axes are indicated in (0, 0) for centered PCA (stapled line) and uncentered PCA (solid line). The upper embedded plot shows uncentered PCs fitted by Gaussians, and the lower embedded plot shows centered PCs fitted by Gaussians.

As explained in [26], uncentered PCA was, however, used in many application areas, such as climatology [27], astronomy [28], the study of microarrays [29], neuroimaging data [30], ecology [31], chemistry [32], and geology [33]. See also [24]. When using uncentered PCA, correlations (as opposed to central moments) are the key quantities. It is not difficult to provide examples where centered PCA and uncentered PCA will behave quite differently. Fig. 1 shows a toy data set consisting of two groups displayed with different symbols for clarity. The stapled coordinate system placed at (0, 0) shows the centered principal axes. The thicker line width corresponds to the largest eigenvalue. As expected the dominant direction for centered PCA follows the maximum variance direction in the data. The solid lines in (0, 0) show the uncentered PCA directions. The dominant direction is aligned quite closely with the mean vector of the data, which is indicated by the stapled line across the plot. In this particular example, uncentered PCA provides principal components in which the two groups are quite well separated in both directions. This is shown by the upper embedded plot, which shows the uncentered PCs fitted by Gaussians. The lower embedded plot shows the centered PCs fitted by Gaussians. In both cases, the dominant direction corresponds to the horizontal axis. For centered PCA, the two groups are only separated in the dominant direction. The centered dominant principal axis is nearly orthogonal to the uncentered dominant principal axis. See also the insightful analysis provided in [26], exploring the relationships between the two variants of PCA.

B. PCA Preserving 2-Norm of Inner-Product Matrix Whether or not the data is centered, PCA is closely linked to the 2-norm, or Frobenius (Hilbert–Schmidt) norm of the innerproduct matrix X X. This matrix in general will be able to take both positive and negative values. If the data is centered, the resulting matrix will take negative values. as a (N 2 × 1)-dimensional vector with elements  Expressed   X X kl for k, l = 1, . . . , N, the squared 2-norm of X X,

JENSSEN: MVCA FOR VISUALIZATION AND CLUSTERING OF NONNEGATIVE DATA

normalized by the number of elements, is   1   2 X X = 2 2 N

N  N  

   X X

k=1 l=1

2 1   = 2 kl N

r 

λ2i .

(2)

i=1

This result is obtained as X X is a symmetric matrix with orthogonal (orthonormal) eigenvectors. This shows that PCA, based on the rˆ top eigenvalues of X X, best approximates 2-norm of X X. III. M EAN V ECTOR C OMPONENT A NALYSIS Unlike centered PCA, where the mean of the data plays no role, the approach taken in this paper will be completely different. A new dimensionality reduction method is derived which maximally preserves the squared-Euclidean length, and implicitly also the direction, of the mean vector of nonnegative data. As mentioned, the mean vector is after all a very important descriptor of the data, at the center of a variety of supervised and unsupervised data analysis methods; see [1]–[3] for examples. The approach in the following is to construct an orthogonal coordinate system represented by a basis that is aligned as much as possible with the mean of the input data. The nonnegative data matrix X is therefore not centered in the following derivations in this section. The key quantity is the squared length of the mean vector of the data, i.e., m2 

m = (X1m ) X1m = 2

 1 m X X1m .

(3)

Thus, the squared length of the mean vector is directly connected to the inner-product matrix. This enables an analysis of m2 in terms of uncentered PCA by expressing X X in terms of its eigenvalues and eigenvectors. X X = VV where the eigenvectors are represented by V = [v1 , . . . , v N ] and the eigenvalues by  = diag (λ1 , . . . , λ N ) in decreasing order. Now, the following is obtained: r 

 1 2 1  λi vi 1m (4) m2 =  2 V 1m  2 V 1m = i=1 1 2

where m ˜ =  V 1m is actually the mean of the data when ˜ 2. projected onto all uncentered PCA axes. m2 = m Equation (4) has several implications. As explained in (1), the √ projection of X onto√the i th uncentered PCA axis is given by λi vi , with mean λi vi 1m . The overall squared length of m is therefore composed of a sum over the squared means of all uncentered PCs. 2

√ The coefficients λi vi 1m are referred to as mean vector values. Let the set M contain the indexes of the eigenvalues corresponding to the rˆ top mean vector values. The 1 ˜ M =  2 V , idea in MVCA is to create the data set X M M ˜ M 2 = maximally the length of m2 , as m

√ preserving   2 ˜ M is obtained by projecting λi vi 1m . Hence, X i∈M X onto a subspace spanned by uncentered PCA axes ui corresponding to top rˆ mean vector values, where projui X = √  λi vi are mean vector components. Thus, in general, MVCA is different from dimensionality reduction based on uncentered PCA, which is performed solely

1555

Algorithm 1 MVCA for dimensionality reduction. Assumed input is (d × N) data matrix X and desired dimensionality rˆ ≤ d, where d is the dimensionality of the input data and N is the number of samples. Compute X X and find eigenvalues and eigenvectors λi , vi , i = 1, . . . , r.

√  2 2: Compute mean vector values λi vi 1m and sort them in decreasing order. 3: Project data X onto a subspace spanned by uncentered PCA axes ui corresponding to top rˆ mean vector values, where √ projui X = λi vi are the i th mean vector components. 1:

on the basis of the size of the eigenvalues of the innerproduct matrix X X. In MVCA, a large eigenvalue λi does not guarantee that the data projected onto the corresponding uncentered PCA axis contributes to the value of m2 . This is because the structure of the eigenvector vi must also be considered. The result is that centered/uncentered PCA and MVCA may produce quite different results in general. In addition, as vi = λ−1/2 ui X, the following expression is obtained: r r 2   m2 = cos2 (ui , m) (5) ui m = m2 i=1

i=1

= 1. This result shows√that the mean of the data X as projected onto ui is given by λi vi 1m = ui m. Therefore, the contribution of an uncentered PCA axis ui to m2 is determined by the cosine of the angle between ui and m, in the sense that the ui ’s most aligned with m contribute angularly N cos2 (ui , m) = 1. This result adds the most. By necessity i=1 to the insight provided in [26]. MVCA is hence a new approach to uncentered PCA, not necessarily selecting a basis corresponding to top eigenvalues of X X, but instead selecting the basis contributing the most to the mean vector of the data corresponding to the top mean

√  2  2 λi vi 1m = ui m . The MVCA algorithm vector values is summarized in Algorithm 1. The complexity of MVCA is the same as for PCA for each eigenvector calculation, however without the need for centering. For MVCA, leading mean vector values may correspond to smaller eigenvalues, so it is needed to calculate several eigenvectors. Very tiny eigenvalues will correspond to very small mean vector values, which may be used as a stopping criteria for how many eigenvectors to compute. u2

IV. MVCA, 1-N ORM AND C LUSTERING S TRUCTURE Focusing only on nonnegative data enables interpretation of MVCA in terms of the preservation of an element-wise matrix 1-norm, and provides tools in terms of Perron–Frobenius theory for analyzing the ability of MVCA to capture the clustering structure in the data. A. MVCA Preserving 1-Norm of Inner-Product Matrix In Section II-B, it is shown that PCA preserves the 2-norm of the inner-product matrix (centered or uncentered). Under

1556

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

1

the nonnegativity assumption, it is straightforward to show that MVCA preserves the 1-norm of the inner-product matrix, as N N   1     1   1  X X  X X = 2 1 kl N 1 = 2 N

k=1 l=1 N N   k=1 l=1



X X

kl

=

r  

λi vi 1m

2

0.5

0

(6)

i=1

because  nonnegativity.   of X X1 equals m2 , and both the eigenvalues and the 1 eigenvectors of X X appear in the 1-norm expression. Hence, performing MVCA actually preserves the 1-norm of the nonnegative X X. In addition, even if some eigenvalue λi takes a large value, hence contributing significantly to the 2-norm of X X, it may not necessarily contribute much to the 1-norm, if vi 1m takes a small value. B. MVCA Preserving Clustering Structure MVCA is based on the spectral properties of a nonnegative inner-product matrix, as the method is derived for nonnegative data. This enables the Perron–Frobenius theory of nonnegative matrices to be invoked to investigate more closely the clustering properties of MVCA. Theorem 1 (Perron–Frobenius): Let M be a real nonnegative matrix that is irreducible (the associated graph is strongly connected). Then, there exists a largest eigenvalue λ (the Perron root) whose associated eigenvector v will contain only nonnegative (or nonpositive) elements. All other eigenvectors will have both positive and negative elements. For proof, see [34], [35]. Now, consider the following idealized situation. Assume the inner-product matrix of the data is block diagonal, and that each block is nonnegative. Hence, there exists an extreme cluster structure in the data with respect to inner products, where each cluster is distributed more or less along an axis of the d-dimensional input space, and no two or more clusters are distributed along the same axis. For simplicity, assume   1 N1 ×N1 0 N1 ×N2  (7) X X= 0 N2 ×N1 1 N2 ×N2 where 1 M×M (0 M×M ) is the (M × M) all ones (zero) matrix. As the blocks are rank one, there will be only one nonzero eigenvalue associated with each block, and the eigenvalues are given by  = di ag(N1 , N2 ). The orthonormal eigenvectors of X X are given by the union of the eigenvectors of the blocks  1  √ 1 N1 0 N 1 N1 V= (8) 0 N2 √1N 1 N2 2

where the subscript is the length of the vectors. √ In this idealized situation, the mean vector components are λ1 v1 = √  ] and  = [0 1 ]. Hence, the mean vector [1 0 λ v 2 2 N1 N2 N1 N2 components are perfect cluster indicators. See also [15], [36] for related analysis with respect to different data matrices. Assume for simplicity that N1 = 20 and N2 = 5. Assume also that Gaussian random noise is added to the (20 × 20)

(a) 0.5

1

0

0.5

−0.5 0

10 (c)

20

0 0

1 2 3 4 5 6 7 8 9 10 (b)

0.5

1

1.5

(d)

Fig. 2. (a) Near block-diagonal inner-product matrix. Gaussian random noise is added to the (20 × 20) block. (b) Normalized eigenvalues (solid lines) and normalized mean vector values (bars). (c) Plot of the first four eigenvectors. The solid (blue) line with nonzero values over the first 20 elements is v1 . The solid (red) line with nonzero values over the last five elements is v4 . These correspond to the two top √ mean vector √ values. (d) Scatter plot of the two top mean vector components λ1 v1 and λ4 v4 . The angular structure is due to the cluster indicator nature of the eigenvectors picked by MVCA.

block (still staying nonnegative), shown in Fig. 2(a). Now, the picture is changed. The normalized eigenvalues are shown in Fig. 2(b) as solid lines. There are now more than two nonzero eigenvalues, and the three largest eigenvalues correspond to the noisy (20 × 20) block. Fig. 2(c) shows the eigenvectors corresponding to the three largest eigenvalues. The solid (blue) line shows √ the top eigenvector. It is more or less constant equal to 1/ 20 over the 20 elements corresponding to the noisy block, and zero for the elements corresponding to the (5 × 5) block. This is a cluster indicator vector. Eigenvectors two and three are, however, both positive and negative over the elements of the noisy block, and are not cluster indicators (shown as stapled and dot-stapled curves). Because of the Perron–Frobenius theory, it is, however, known that there should exist one more nonnegative eigenvector corresponding to the (5 × 5) block. Fig. 2(b) shows normalized mean vector values corresponding to the eigenvalues as bars (red). In addition to λ1 , the eigenvalue λ4 corresponds to a large mean vector value. And indeed, the reason λ4 corresponds to a large mean vector value is that the eigenvector v4 is the nonnegative cluster indicator vector with nonzero values corresponding to the (5 × 5) block, shown in Fig. 2(c) as the solid (red) line. It is clear that the block-structured eigenvectors will con tribute to the mean vector of the √ data (1-norm 2 of X X) as the  associated mean vector values λi vi 1m will be deviating more from zero compared with eigenvectors with both positive and negative values over a block. Hence, the eigenvalues and eigenvectors picked by MVCA will capture possible cluster structure in the data. Fig. 2(d) shows √ a scatter plot of√the two largest mean vector components, i.e., λ1 v1 versus λ4 v4 (M = {1, 4}). Note how the (5×5) block corresponds to the data points embedded

JENSSEN: MVCA FOR VISUALIZATION AND CLUSTERING OF NONNEGATIVE DATA

in approximately [0 1], while the (20 × 20) block corresponds to the data points scattered around [10]. Even for inner-product matrices deviating significantly from block diagonal, this kind of angular structure for the scatter plot of the top mean vector components will appear if the eigenvectors picked by MVCA are (close to) cluster indicator vectors. V. R ELATED W ORK This paper is inspired especially by kernel entropy component analysis (kernel ECA) [36]. This is perhaps surprising, as kernel ECA and MVCA are developed from entirely different starting points, and are based on completely different matrices. In this paper, it is, however, shown that kernel ECA may be interpreted as MVCA in a kernel feature space, given a positive semi-definite (psd) kernel function.  In [36], it was shown that the quantity V ( p) = p 2 (x)dx, closely related to the Renyi entropy of the √ density p(x),  may N  1 2 , using K1 = δ e be estimated as Vˆ ( p) = 1 m i i m m i=1 Parzen windowing based on a data set x1 , . . . , x N and a psd kernel. Here, the uncentered (non-negative) inner-product, or   kernel matrix, K, is such that Ki j = φ(xi ), φ(x j ) , where φ(·) maps the input data to a kernel feature space. The eigenvalues and eigenvectors of K are given by δi , ei , i = 1, . . . , N. In kernel ECA, a data transformation is performed to maximally preserve the entropy of the input space data set, by projecting the kernel feature space data onto uncentered √ kernel PCA 2 axes √   δi ei corresponding to top entropy values δi ei 1m . The top entropy values are in general not corresponding to the top eigenvalues of K. In passing, it was noted in [36] that the entropy corresponds to Vˆ ( p) = m2 , where  estimate N φ(xl ) is the mean vector of the data in kernel m = 1/N l=1 feature space. In [36], the importance of the mean vector was not realized because the emphasis was on entropy. In the context of this article, however, it becomes clear that the preservation of entropy actually corresponds to MVCA in kernel feature space. The novelty of this paper is to develop MVCA based on the input space nonnegative data set X. Hence, there is no need to select any kernel parameter, which is known to significantly influence the results, and which hampers the use of kernel methods in real applications. In addition, MVCA in the input space enables an analysis in terms of the (d × d) correlation matrix as its eigenvalues and eigenvectors are equivalent to those of the inner-product matrix X X. Working in a kernel feature space, the correlation matrix is not available, and one instead is forced to compute eigenvectors of the typically much larger (N × N) matrix K. The nonlinear capabilities of kernel methods are lost in MVCA. Another work that is some extent related to MVCA is the socalled data spectroscopy method [37]. Using convolution operator theory, [37] showed that for a nonnegative matrix arising from convolution involving kernel functions, certain eigenvectors will be nonnegative over the support of clusters, or mixture densities, comprising the data set. This is closely related to the Perron–Frobenius theory of nonnegative matrices invoked to analyze the clustering properties of MVCA for nonnegative inner-product matrices, although Perron–Frobenius theory was not mentioned in [37],

1557

Finally, [38] is also mentioned here, defining so-called eigenclusters as the nonnegative eigenvectors of a graph matrix. These nonnegative eigenvectors are not in general corresponding to the top eigenvalues, and are closely related to the eigenvectors used in data spectroscopy. VI. V ISUALIZATION E XPERIMENTS Visualization of high-dimensional data is very important to gain some understanding of the underlying structures of the data. In most cases, one is interested in reducing the dimensionality of the data to two dimensions, to enable a visualization by a two-dimensional (2-D) plot. Here, the primary comparison will be between centered PCA and MVCA. For centered PCA, the visualization is by definition based on the two top eigenvalues of the centered inner-product matrix. For MVCA, the visualization is based on the two top mean vector values, corresponding to the uncentered inner-product matrix. Of special interest is the preservation of the clustering structure in the data, when comparing MVCA and PCA. The datasets used are images and word-document data, represented by a bag-of-words version of the Newsgroups set1 . A. Visualization Experiment I As a first visualization experiment, the 1965(28 × 20) Frey faces are embedded in a 2-D representation. This is a data set which cannot be divided into explicit classes, as the images represent the same person with varying facial expression, tilt, and so on. Fig. 3(a) and (b) show the MVCA and the centered PCA visualization, respectively. Notice how some of the faces are smiling. For those faces which are not smiling, some are looking straight ahead, while some faces are looking more to the left, or to the right. For PCA, it seems that the method best separates faces looking to the left, from those looking to the right. Smiling faces and nonsmiling faces looking straight ahead are located quite close together, as indicated by the circles. √ √ MVCA, on the other hand, based on λ1 v1 and λ5 v5 (M = {1, 5}), seems to focus more on separating smiling faces from nonsmiling faces. In terms of facial expression, smiling faces versus nonsmiling faces represent the most dominant cluster structure in the data. The smiling faces are located in the lower part of the plot, whereas the nonsmiling faces are mostly located in the upper part of the plot, indicated by circles. For completeness, we illustrate for this data set also some alternative dimensionality reduction methods. Fig. 3 (c) and (d) shows the Laplacian eigenmap [16] and the Isomap [19] solutions using in both cases 8 nearest neighbors to build the neighborhood graph2. These methods produce quite different embeddings. In addition, the important issue of whether or not some outliers will change the visualization result is investigated. To that end, five random MNIST digits (representing 0 see next section) are added into the Frey faces data set. In this process, 1 Frey faces, MNIST digits, and Newsgroups data obtained from http://cs.nyu.edu/∼roweis/data.html. 2 Using code obtained at http://www.math.ucla.edu/∼wittman/mani/.

1558

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013 800

400 300

600 200 400

100 0

200

−100 0

−200 −300

−200

−400 −400 −500 −600 3300

3400

3500

3600

3700

3800

3900

4000

4100

−600 −600

0.03

4000

0.02

3000

0.01

2000

0

−400

−200

0

200

400

600

800

1000

−0.01

0

−0.02 −1000 −0.03 −2000 −0.04 −3000 −0.05 −0.05

0

0.05

0.1

0.15

−3000 −2000 −1000

0

1000

2000

3000

4000

800

400 300

600

200 400

100 0

200

−100 0

−200 −300

−200

−400 −400

−500 −600 3300

3400

3500

3600

3700

3800

3900

4000

4100

−600

−600

−400

−200

0

200

400

600

√ √ Fig. 3. Visualization of Frey faces using several different methods. (a) MVCA visualization based on λ1 v1 and λ5 v5. (b)–(d) Visualizations using PCA, Laplacian eigenmaps and Isomap, respectively. In (e) and (f), five outliers are inserted into the data set. The outliers are embedded in (e) (0, 0), and (f) (−493.2, 2932.2), and are in both cases outside of the illustrated region.

the dimensionality of the MNIST digits are reduced to that of the Frey faces. On performing MVCA, the optimal basis is changed from M = {1, 5} to M = {1, 4}. Interestingly, for the Frey faces, this leads to a visualization which is exactly equal to the previous one, however, based on different eigenvectors. The MNIST digits remain outliers in the MVCA representation and are actually all five embedded in (0, 0). Fig. 3(e) shows the result. (0, 0) is not shown. For PCA, the outliers actually change the visualization, shown in Fig. 3(f). The top two eigenvectors no longer represent the same embedding for the Frey faces. Also for PCA, the five MNIST images remain outliers, and are embedded in (−493.2, 2932.2), which is outside of the region shown. B. Visualization Experiment II The next visualization experiment focuses further on the ability of MVCA to capture the clustering structure. The data set is MNIST digits represented by classes 0 (2500 samples),

1 (500 samples) and 2 (2000 samples). Instead of providing a single 2-D visualization, in this case, MVCA and PCA are chosen to represent the data in three dimensions, and then three visualizations are provided by scatter plots between pairs of dimensions. The results are shown in Fig. 4(a)–(c) for MVCA and (d)–(f) for PCA (best viewed in color). For MVCA, M = {1, 2, 4}. It is evident from all plots that the classes overlap. The most striking difference occurs in the third visualization, i.e., Fig. 4(c) for MVCA compared with Fig. 4(e) for PCA, respectively. MVCA preserves the cluster structure in all three visualizations. For PCA, however, Fig. 4(e) shows the 2 and 1 classes embedded in the middle of the 0 class. In the case that MVCA and PCA are to be used as preprocessing methods, clearly, MVCA produces the most discriminating features for further processing. In this example, the number of samples used to represent the digits from each class influences the visualizations. With an equal number of class-wise samples, MVCA and PCA results tend to appear more similar compared with the case where

JENSSEN: MVCA FOR VISUALIZATION AND CLUSTERING OF NONNEGATIVE DATA

6 4

1559

6

6

4

4

2

2

0 −2

Dimension 3

Dimension 3

Dimension 2

2

0

0

−2

−2

−4

−4

−4 −6 −8 0

5

10

−6 0

15

5

Dimension 1

10

−6 −8

15

−6

−4

Dimension 1

8 6

6

6

4

4

2

2

−2 0 Dimension 2

2

4

6

0 −2

Dimension 3

2

Dimension 3

Dimension 2

4

0

0

−2

−2

−4

−4

−4 −6 −8 −6

Fig. 4.

−4

−2

0 2 Dimension 1

4

6

8

−6 −6

−4

−2

0 2 Dimension 1

4

6

8

−6 −10

−5

0 Dimension 2

5

10

Visualization of MNIST digits. (a)–(c) MVCA. (d)–(f) PCA.

the number of class-wise samples are unbalanced (results not shown). Future work will focus on discovering the underlying mechanisms yielding these results. C. Visualization Experiment III In the following, the words in the Newsgroups data are visualized. Even though there are groups of documents in the data set, it is very unclear how the words across the documents are distributed, and it is interesting to investigate whether or not MVCA gives a different result compared with PCA. There are 100 words in the bag. Randomly, 10% of the documents are selected, obtaining a total of 1625 documents distributed over the four categories comp, rec, sci and talk. Fig. 5 (best viewed in color) shows both the centered PCA embedding of the 100 words (using Times New Roman font and blue color) and the MVCA embedding (using Palatino font and red color) based on the (1625 × 100) √ √ document-word matrix. In this case, MVCA is based on λ1 v1 and λ4 v4 , i.e., M = {1, 4}. The visualized words appear indeed quite different. When zooming in on the embeddings (not shown here), one example of the differences between the methods is provided by the word government, for which centered PCA is close to, e.g., religion and christian. For MVCA, government is closer to technology and science. Another example is the very tight MVCA cluster of words computer, course, windows, university, and case. In the centered PCA representation, these words are spread out over the plane. A third example is given by the words world and fact. For centered PCA, these two words appear together with question. For MVCA these two words are close together, but question is much closer to, e.g., email.

To highlight the different visualizations more clearly, the table shown in Fig. 5 shows some words and their five nearest neighbors among the embedded points. For both MVCA and PCA the pairwise distances between embedded words are quantized into five bins represented by a shade of gray for each table cell. The darker the shade of gray is, the closer the neighboring word is. The first five words of each table column correspond to MVCA and the remaining five words correspond to PCA. The word course, for example, have different neighbors in the two embeddings, and the distribution of distances to the neighbors differs too. VII. C LUSTERING E XPERIMENTS A widely used approach in data analysis is to cluster rˆ -dimensional PCA data using, for example, k-means [10]. In this section, we illustrate MVCA for clustering of nonnegative data, and compare it primarily with PCA, but also to some other methods. All permutations between cluster labels and the ground truth labels are compared, and we report clustering results as the combination resulting in the smallest error. The number of clusters is assumed to be known. A. Angular MVCA k-Means Clustering The MVCA representation of the data is here clustered using an angular k-means clustering algorithm, similar to [36]. The reason for this choice is that the MVCA embedding for nonnegative data and hence a nonnegative inner-product matrix, often has an angular structure, in the sense that the clusters are distributed more or less in different angular directions with respect to the origin. Fig. 6 shows this by a

1560

Fig. 5.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

Visualization of words in the Newsgroups data.

4

25

2

k-means clustering using an angular similarity measure, which is the approach taken. Convergence to a local optimum will be guaranteed. Clustering PCA data (and data obtained with other dimensionality reduction techniques) will be achieved using both Euclidean distance-based k-means, and angular k-means. Mean error results are shown obtained by running the clustering algorithms 100 times. In addition, the best results for each rank, or dimensionality, are shown.

20

0 15

−2 −4

10

−6 5

−8 −10

0

B. Clustering Experiment I

−12 −5

−14 −16 −20

0

20 (a)

40

−10 0

10

20

30

(b)

Fig. 6. (a) Centered PCA embedding. (b) MVCA embedding. The angular structure of MVCA, in the sense that clusters are distributed more or less in different angular directions with respect to the origin.

scatter plot of a centered PCA embedding (a) and a MVCA embedding (b) using the classes crude and train, extracted from the Reuters bag-of-words data set3 . The clusters are marked by different symbols for visualization purposes. The embeddings are different. The MVCA embedding appears ideal for 3 http://www.zjucadcg.cn/dengcai/Data/TextData.html.

Fig. 7(a) shows the MVCA clustering result (error rate) for a subset of Reuters, as a function of the rank, rˆ . The classes crude, train, and money-fx are used. There are 18933 words in the bag, and a total of 864 documents. The 10 top mean vector values correspond to M = {1, 3, 9, 6, 11, 14, 2, 15, 24, 36}, in that order. Obviously, MVCA has very different preferences when it comes to the eigenvalues and eigenvectors to use, compared with PCA. This is reflected in the clustering results obtained. The PCA clustering result is shown in Fig. 7(b). For a wide range of ranks, MVCA obtains an error rate around 10%, compared with PCA clustering at around 20%. The best MVCA results are as low as 3.9%, compared with PCA not less than 20% errors at best. For completeness, clustering based on Laplacian eigenmap, Isomap, and multidimensional

JENSSEN: MVCA FOR VISUALIZATION AND CLUSTERING OF NONNEGATIVE DATA

1561

1

1 Angular k−means Best

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

20

40

60

80

k−means Angular k−means Best

0.9

Errors

Errors

0.9

0 0

100

20

40

Rank

1

0.8

100

k−means Angular k−means Best

0.9 0.8

0.7

0.7

0.6

0.6 Errors

Errors

80

1 k−means Angular k−means Best

0.9

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1 0 0

60 Rank

0.1 20

40

60

80

0 0

100

20

40

Rank

60

80

100

Rank

1 k−means Angular k−means Best

0.9 0.8 0.7 Errors

0.6 0.5 0.4 0.3 0.2 0.1 0 0

20

40

60

80

100

Rank

Fig. 7. Clustering error results for Reuters data as a function of dimensionality or rank. (a) MVCA. (b) PCA. (c) Laplacian eigenmaps. (e) Isomap. (f) MDS.

scaling (MDS) [39] are shown in Fig. 7(c)–(e). None of these alternative methods perform very well on this data set. C. Clustering Experiment II The Newsgroups data is revisited, this time focusing on the (100 × 1625) word-document matrix. MVCA clustering results are shown in Fig. 8(a), and PCA clustering results are shown in Fig. 8(b). On this data set, PCA obtains a mean clustering result which is comparable with running k-means on the original data (60.3% errors), meaning that no gain is achieved by PCA clustering. On the other hand, MVCA yields average results much lower, at less than 45% errors. The 10 top mean vector values correspond in this case to the eigenvalues M = {1, 2, 11, 3, 9, 15, 8, 18, 24, 22}, in that order. MVCA produces good results also compared with Isomap and MDS, as shown in Fig. 8(c) and (d) (Laplacian eigenmaps performed no better—results not shown).

D. Clustering Experiment III In this clustering experiment, MVCA and PCA are used in an attempt to group images corresponding to three different visual concepts. SIFT [40] descriptors represented by a 1000-dimensional codebook for each visual concept are downloaded from the ImageNet data base (image-net.org) [41]. The visual concepts used are strawberry, lemon and australian terrier. The concepts are represented by 1478, 1292, and 1079 images, respectively. The images within each category differ very much, as can be seen, e.g., for australian terrier at image-net.org/synset?wnid=n02096294. A crude approach is taken here. Each image is represented by the overall frequency of codewords present for the SIFT descriptors contained in the image. Hence, each image is represented as a 1000-dimensional vector. The local modeling strength of the SIFT descriptors are lost in this way, and one cannot expect the resulting data set to contain very discriminative features between the concepts. This is confirmed by executing

1562

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

1

1 Angular k−means Best

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0

20

40

60

80

k−means Angular k−means Best

0.9

Errors

Errors

0.9

0 0

100

20

40

Rank

1

100

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 20

40

60

80

100

k−means Angular k−means Best

0.9

Errors

Errors

0.8

0 0

20

40

Rank

60

80

100

Rank

Clustering error results for Newsgroups data as a function of dimensionality or rank. (a) MVCA. (b) PCA. (c) Isomap. (D) MDS.

k-means on this data set, which yields an average error rate of 64.5% over 20 iterations. Clustering based on MVCA and PCA are performed 20 times for each rank, and mean error rates are shown for each rank in Fig. 9. MVCA (M = {2, 1, 3, 5, 8, 4, 9, 6, 15, 10} for first 10 components) is able to better extract the clustering structure in this case as well, obtaining an overall mean result over all ranks of 60.2%. Considering the very inhomogeneous data set, this is a considerable improvement. Upon closer inspection of the results (not shown here) it is clear that all methods have huge problems detecting the australian terrier group, while MVCA is able to better discriminate between strawberry and lemon. E. Fisher Discriminant Ratio and Correlation As a final experiment on quantifying the ability of MVCA to capture the clustering or class structure in a data set, the voting data set obtained from the UCI repository is studied [42]4. The two classes in the data are represented by U.S. House of Representatives Democrats and Republicans, respectively, and 232 congressmen votes on 16 issues (only representatives who voted on all 16 issues are retained from the original data set). For MVCA, the first 10 elements in the set M = {1, 2, 11, 3, 9, 15, 8, 18, 24, 22}. For each dimension, or rank, the Fisher discriminant ratio (FDR) is computed between the two classes (assumed known). Fig. 10 4 http://archive.ics.uci.edu/ml

0.67 MVCA PCA

0.66 0.65 0.64 0.63 Errors

Fig. 8.

80

1 k−means Angular k−means Best

0.9

0 0

60 Rank

0.62 0.61 0.6 0.59 0.58 0.57

Fig. 9.

3

10

20

30

40 50 Rank

60

70

80

90

Clustering results for ImageNet data as a function of rank.

shows the result. The solid (blue) line corresponds to the MVCA embedding, and the stapled (red) line corresponds to PCA. For all dimensions up to 16, MVCA FDR is greater than PCA FDR. This illustrates that MVCA is capable of capturing the clustering or class structure of the data to a greater extent compared with PCA. The correlation between the different dimensions, or channels, of the MVCA and PCA representations are also shown as the dotted solid line (MVCA, in blue) and dotted stapled line (PCA, in red), respectively. The correlation in the MVCA representation is less than that of PCA.

JENSSEN: MVCA FOR VISUALIZATION AND CLUSTERING OF NONNEGATIVE DATA

4 MVCA PCA

3.5

FDR/Correlation

3 2.5 2 1.5 1 0.5 0

Fig. 10.

2

10 Dimension

20

Fisher discriminant ratio and correlation for voting data set.

VIII. C ONCLUSION This paper presented MVCA as a new dimensionality reduction method for visualization and clustering of nonnegative data. MVCA preserved the squared-Euclidean length of the mean vector of the input data, by projections onto certain uncentered PCA axes not necessarily corresponding to the top eigenvalues of the nonnegative inner-product matrix, in contrast to traditional spectral methods such as PCA. Focusing only on nonnegative data enabled a discussion in terms of Perron–Frobenius theory showing that MVCA picked eigenvectors that captured the cluster structure, as opposed to PCA which was tuned to the variance of the data. Hence, if it was important to the user to preserve clustering structure, MVCA should be considered as an alternative to PCA. Through a series of experiments on nonnegative data, it was shown that MVCA may produce different results than PCA, and especially for clustering, considerably improved results. For completeness, it was also shown that MVCA performed well compared with some alternative and well-known dimensionality reduction methods when in case of clustering. MVCA did not always provide very different results compared with PCA, and a task for future work is therefore to analyze further the conditions for which MVCA gives different results than PCA. Finally, MVCA used in applications where PCA is traditionally used, such as in denoising or anomaly detection, should be investigated. ACKNOWLEDGMENT The author would like to thank Dr. S. O. Skrøvseth at the Norwegian Center for Integrated Care and Telemedicine, University Hospital of North Norway, for insightful comments on an earlier version of this manuscript. He would also like to thank three anonymous reviewers and the Associate Editor for their comments and suggestions. R EFERENCES [1] S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th ed. San Diego, CA, USA: Academic, 2009. [2] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York, NY, USA: Wiley, 2001.

1563

[3] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer-Verlag, 2006. [4] K. Pearson, “On lines and planes of closest fit to systems of points in space,” Phil. Mag., vol. 2, no. 6, pp. 559–572, 1901. [5] H. Hotelling, “Analysis of a complex of statistical variables into principal components,” J. Educ. Psychol., vol. 24, no. 1, pp. 417–441, 1933. [6] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: Springer-Verlag, 2002. [7] E. Oja, “Simplified neuron model as a principal component analyzer,” J. Math. Biol., vol. 15, no. 3, pp. 267–273, 1982. [8] K. I. Diamantaras and S. Y. Kung, Principal Component Neural Networks. New York, NY, USA: Wiley, 1996. [9] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [10] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. Berkeley Symp. Math. Stat. Probab., 1967, pp. 281–297. [11] B. Schölkopf, A. J. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, Jul. 1998. [12] S. Moon and H. Qi, “Hybrid dimensionality reduction method based on support vector machine and independent component analysis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 5, pp. 749–761, May 2012. [13] X. Kong, C. Hu, H. Ma, and C. Han, “A unified self-stabilizing neural network algorithm for principal and minor components extraction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 2, pp. 185–198, Feb. 2012. [14] S. Liwicki, S. Zafeiriou, G. Tzimiropoulos, and M. Pantic, “Efficient online subspace learning with an indefinite kernel for visual tracking and recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1624–1636, Oct. 2012. [15] A. Y. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2002, pp. 849–856. [16] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, Jun. 2003. [17] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [18] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, Dec. 2000. [19] J. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, Dec. 2000. [20] K. Q. Weinberger and L. K. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” Int. J. Comput. Vis., vol. 70, no. 1, pp. 77–90, 2006. [21] L. K. Saul, K. Q. Weinberger, J. H. Ham, F. Sha, and D. D. Lee, “Spectral methods for dimensionality reduction,” in Semisupervised Learning, O. Chapelle, B. Schölkopf, and A. Zien, Eds. Cambridge, MA, USA: MIT Press, 2005, ch. 1. [22] C. J. C. Burges, “Geometric methods for feature extraction and dimensional reduction,” in Data Mining and Knowledge Discovery Handbook: A Complete Guide for Researchers and Practitioners, O. Maimon and L. Rokach, Eds. Norwell, MA, USA: Kluwer, 2005, ch. 4. [23] K. Balci and V. Atalay, “PCA for gender estimation: Which eigenvectors contribute?” in Proc. Int. Conf. Pattern Recognit., 2002, pp. 363–366. [24] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004. [25] A. A. Miranda, Y.-A. Borgne, and G. Bontempi, “New routes from minimal approximation error to principal components,” Neural Process. Lett., vol. 27, no. 3, pp. 197–207, 2008. [26] J. Cadima and I. Jolliffe, “On relationships between uncentred and column-centred principal component analysis,” Pakistan J. Stat., vol. 25, no. 4, pp. 473–503, 2009. [27] H. van den Dool, Empirical Methods in Short-Term Climate Prediction. London, U.K.: Oxford Univ. Press, 2007. [28] S. Clements, “Principal component analysis applied to the spectral energy distribution of blazars,” in Proc. Blazar Continuum Variability, Astronomical Soc. Pacific Conf. Series, 1996, pp. 455–461.

1564

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 10, OCTOBER 2013

[29] O. Alter and G. H. Golub, “Singular value decomposition of genomescale mRNA lengths distribution reveals asymmetry in RNA gel electrophoresis brand broadening,” in Proc. Nat. Acad. Sci., vol. 103, no. 32, pp. 11828–11833, Jul. 2006. [30] F. Gwadry, C. Berenstein, J. van Horn, and A. Braun, “Implementation and application of principal component analysis on functional neuroimaging data,” Univ. Inst. Syst. Res., School of Eng., Tech. Rep. TR 2001-47, 2001. [31] C. ter Braak, “Principal components biplots and alpha and beta diversity,” Ecology, vol. 64, no. 3, pp. 454–462, 1983. [32] J. E. Jackson, A User’s Guide to Principal Components. New York, NY, USA: Wiley, 1991. [33] R. A. Reyment and K. G. Jöreskog, Applied Factor Analysis in the Natural Sciences. Cambridge, U.K.: Cambridge Univ. Press, 1993. [34] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. Philadelphia, PA, USA: SIAM, 2000. [35] A. N. Langville and C. D. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton, NJ, USA: Princeton Univ. Press, 2006. [36] R. Jenssen, “Kernel entropy component analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 847–860, May 2010. [37] T. Shi, M. Belkin, and B. Yu, “Data spectroscopy: Eigenspaces of convolution operators and clustering,” Ann. Stat., vol. 37, no. 6, pp. 3960–3984, 2009. [38] S. Sarkar and K. L. Boyer, “Quantitative measures of change based on feature organization: Eigenvalues and eigenvectors,” Comput. Vis. Image Understand., vol. 71, no. 1, pp. 110–136, 1998. [39] M. Lee, “Determining the dimensionality of multidimensional scaling representations for cognitive modeling,” J. Math. Psychol., vol. 45, no. 1, pp. 149–166, Feb. 2001. [40] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255. [42] A. Frank and A. Asuncion, “UCI machine learning repository,” Univ. of California, School Inf. Comput. Sci., Irvine, CA, USA, Tech. Rep., 2013.

Robert Jenssen (M’03) is an Associate Professor with the Department of Physics and Technology, University of Tromsø, Tromsø, Norway, and a Research Professor with the Norwegian Center for Integrated Care and Telemedicine. He was a Guest Researcher with the Technical University of Denmark, Kongens Lyngby, Denmark, from 2012 to 2013, Technical University of Berlin, Berlin, Germany, from 2008 to 2009, and University of Florida, Gainesville, FL, USA, from 2002 to 2003. He focused on developing an information theoretic approach to machine learning, with strong connections to Mercer kernel methods and to spectral clustering and dimensionality reduction methods. He received the Honorable Mention for the 2003 Pattern Recognition Journal Best Paper Award, the 2005 IEEE ICASSP Outstanding Student Paper Award, and the 2007 UiT Young Investigator Award. His paper, “Kernel Entropy Component Analysis,” was the featured paper of the May 2010 issue of the IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE. In addition, his paper, “Kernel Entropy Component Analysis for Remote Sensing Image Clustering” received the IEEE Geoscience and Remote Sensing Society Letters Prize Paper Award in 2013. He served on the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee from 2006 to 2009, and he is currently an Associate Editor of Pattern Recognition.

Mean vector component analysis for visualization and clustering of nonnegative data.

Mean vector component analysis (MVCA) is introduced as a new method for visualization and clustering of nonnegative data. The method is based on dimen...
3MB Sizes 0 Downloads 0 Views