Similarity Learning of Manifold Data.

1744

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

Similarity Learning of Manifold Data Si-Bao Chen, Chris H. Q. Ding, Member, IEEE, and Bin Luo, Senior Member, IEEE

Abstract—Without constructing adjacency graph for neighborhood, we propose a method to learn similarity among sample points of manifold in Laplacian embedding (LE) based on adding constraints of linear reconstruction and least absolute shrinkage and selection operator type minimization. Two algorithms and corresponding analyses are presented to learn similarity for mixsigned and nonnegative data respectively. The similarity learning method is further extended to kernel spaces. The experiments on both synthetic and real world benchmark data sets demonstrate that the proposed LE with new similarity has better visualization and achieves higher accuracy in classification. Index Terms—L1 minimization, Laplacian embedding (LE), manifold learning, reconstruction, similarity learning.

I. I NTRODUCTION SUALLY, the dimension of data is very high in many real world pattern recognition applications. Discovering the potential low-dimensional intrinsic topology structures of those high-dimensional data is an indispensable preprocessing step for lots of further data analysis processes, such as data visualization and pattern recognition. The classical methods of dimensionality reduction such as principle component analysis (PCA) [1] and multidimensional scaling (MDS) [2] usually work well when the data points lie close to some linear subspaces while tending to fail to discover nonlinear structures of the data set. In recent years, many researchers in the machine learning and pattern recognition communities have devoted themselves to learn nonlinear low-dimensional manifolds from sample data points embedded in high-dimensional spaces. The most representative methods are isometric feature mapping (ISOMAP) [3], locally linear embedding (LLE) [4], local tangent space alignment (LTSA) [5] and Laplacian embedding (LE) [6]. Most of these manifold learning algorithms share the same framework, which consists of three general steps: 1) constructing adjacency graph of neighborhood among

U

Manuscript received February 2, 2014; revised June 18, 2014 and September 18, 2014; accepted September 19, 2014. Date of publication October 8, 2014; date of current version August 14, 2015. This work was supported in part by the Natural Science Foundation of China under Grant 61202228 and Grant 61073116, in part by the Doctoral Program Foundation of Institutions of Higher Education of China under Grant 20103401120005, and in part by the Collegiate Natural Science Fund of Anhui Province under Grant KJ2012A004 and Grant KJ2012A008. This paper was recommended by Associate Editor X. Wang. S.-B. Chen and B. Luo are with the MOE Key Laboratory of Intelligent Computing and Signal Processing, School of Computer Science and Technology, Anhui University, Hefei 230601, China (e-mail: [email protected]; [email protected]). C. H. Q. Ding is with the Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2359984

Fig. 1. Similarity construction on manifold among data points A, B, and C. Because A is closer to B than to C, standard/traditional pairwise kernel/similarity S(A, B) is bigger than S(A, C). However, on the manifold, there are many data points distributed between A and C, thus A and C should be more similar than A and B. For example, in classification or clustering, A and C will be in the same class whereas B will be in another class. The proposed manifold similarity satisfies this desired property: W(A, C) > W(A, B), whereas the traditional similarity does not properly describe the situation/structure.

sample points; 2) computing some kind of pairwise feature (similarity) matrix within the neighborhood to approximate the local manifold structure; and 3) minimizing a criterion function to obtain low-dimensional embedding which can be turned into solving an eigenvalue problem. Most of these manifold learning algorithms share the same start step of constructing adjacency graph of neighborhood to learn similarity between data points. Traditional similarity between data points is learned on adjacency graph of neighborhood, which is constructed from pair of data points individually. No consideration is given to their local environment for similarity learning. Furthermore, Euclidean distance between data points is used to construct adjacency graph of neighborhood. As shown in Fig. 1, the data point A is closer to B than to C. Thus standard/traditional pairwise kernel/similarity S(A, B) is bigger than S(A, C). However, on the manifold, many data points distribute between A and C, thus A and C should be more similar than A and B. For example, in classification or clustering, A and C will be in the same class whereas B will be in another class. Therefore, the desirable similarity between A and C should be larger than that between A and B. Two strategies, k-nearest-neighborhood (k-NN) and -nearest-neighborhood (-NN), are commonly used for constructing adjacency graph of neighborhood. However, given a manifold dataset, it is hard to choose a proper neighbor number k or radius beforehand. Sometimes, due to variant sample densities or manifold curvatures, the k-NN or -NN may need to decide different k or for each sample point.

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CHEN et al.: SIMILARITY LEARNING OF MANIFOLD DATA

1745

Fig. 2. 1-D embedding and corresponding similarity matrices of LE and LEnewS on C-shape and S-shape manifold data sets. After performing the 1-D embedding, the numbers 1, 2, . . . are used to indicate the order of data points on the flattened 1-D manifold. For the visualization purpose, blue lines connect data points which are adjacent in the embedded 1-D subspace. Under each 1-D embedding graph is the corresponding similarity matrix used.

As shown in Fig. 1, It is difficult to choose k or to obtain that data point C is a neighbor of A meanwhile B is not. To find a proper neighborhood, many methods have been proposed for particular algorithms. An automatic algorithm to detect the optimal parameter value of LLE is presented in [7], which is further extended to choose the optimal neighbor number for ISOMAP [8]. Locally estimated geodesic distance and relative transformation are utilized to optimize neighborhood in [9] and [10]. Dynamical neighborhood for each point is selected and tangent subspace is constructed based on sampling density and manifold curvature [11]. Neighborhood contraction and neighborhood expansion strategies are proposed to select neighbors with local linear approximation [12]. Stochastic k-neighborhood is selected for supervised and unsupervised learning in [13]. However, neighborhood is only an approximation of local manifold, and there is a deviation between neighborhood and true local manifold. Describing local manifold structure with improper neighborhood will bring in error to the following step of computing pairwise feature (similarity) matrix, which will decay the performance of the manifold learning algorithm in the end. For example, the performance of LE varies greatly with different number of neighbors. Sometimes, the

performance degrades badly when setting improper neighbor numbers. Fig. 2 shows two simple examples where LE fails to find correct low-dimensional embedding with improper number of neighbors. On the other hand, the purpose of constructing neighborhood is to let the pairwise feature (similarity) matrix, which is computed in the second step of manifold learning algorithms, represents the local manifold structure more precisely. If one can compute the pairwise feature (similarity) matrix properly to represent the local manifold structure precisely, then the first step of constructing neighborhood can be omitted. Therefore, circumventing the construction of neighborhood to learn pairwise feature (similarity) matrix is worthy of investigation. In the literature, distance-metric learning is very closely related to similarity learning [14]–[19]. Recently, Bian and Tao [20] considered constrained empirical risk minimization for distance metric learning. Bellet et al. [21] learned similarity for provably accurate sparse linear classification and good edit similarity learning by loss minimization [22]. There are also other similarity learning methods, such as Riemannian similarity [23], nonlinear metric [24], and hierarchical similarity [25]. Distance metric is learned from sparse pairwise constraints [26], under covariate shift [27], from a network [28], and by large

1746


margin nearest neighbor classification [29]. Mixture of sparse distance metrics is learned in [30]. Metric is learned to generalize to new classes at near-zero cost in [31]. Semi-supervised metric is learned using pairwise constraints [32] and via entropy regularization [33]. Low rank metric with manifold regularization is learned in [34]. Maggini et al. [35] learned similarity neural networks (SNNs) for pairs of patterns by exploiting a binary supervision on their similarity/dissimilarity relationships. Tang et al. [36] learned similarity with multikernel. In this paper, we focus on the similarity learning in LE [6], one of the most important manifold learning algorithms. The classical LE algorithm computes pairwise similarity matrix based on neighborhood, while we try to learn new similarity matrix of LE without constructing neighborhood. To our knowledge, there is no algorithm which learns similarity matrix in LE without the help of neighborhood for manifold data in the literature. The basic idea of our method originates from the fact that LLE [4] reconstructs data point with neighbors and the fact that least absolute shrinkage and selection operator (LASSO)type minimization [37] produces sparsity. We find that when a data point is reconstructed convexly by its neighbor points, the reconstruction coefficients share some common property with similarity. More closer sample points are more similar and the corresponding reconstruction coefficients are bigger, while distant sample points are less similar and the corresponding reconstruction coefficients are smaller [4]. Starting from Gaussian heat kernel similarities among all sample points rather than within neighborhood in input space, we learn new similarity based on reconstruction and L1 minimization. It is worthwhile to highlight some properties of the proposed approach here. 1) The similarity matrix of manifold data is learned among all sample points rather than within neighborhood, therefore, no neighborhood graph construction is needed beforehand. 2) Two algorithms and corresponding analyses are proposed to learn similarity for mix-signed and nonnegative data respectively. 3) Although we focus on the similarity matrix learning in LE, the learned similarity matrix can also be utilized to other manifold learning methods, such as MDS [2], ISOMAP [3] and locality preserving projections (LPP) [38], [39]. The rest of this paper is organized as follows. We first briefly review LE, then learn new similarity without constructing neighborhood, followed by a simple illustration. In Section III, we give two algorithms and accompanying analyses to learn new similarity for mix-signed and nonnegative data respectively. Learning similarity in kernel spaces is presented in Section IV. Section V gives experiments on visualization and classification and Section VI concludes this paper. II. L EARNING N EW S IMILARITY IN LE A. LE Given n data points X1 , X2 , . . . , Xn ∈ Rp , LE [6] pursues their low dimensional representation Y1 , Y2 , . . . , Yn ∈

Rd (d < p), which constructs a weighted graph with n points as nodes, and a set of weighted edges connecting neighboring points. The embedding map of it is obtained by computing the eigenvectors of the graph Laplacian corresponding to the least nonzero eigenvalues. The whole procedure of LE is summarized as follows. 1) Constructing the Adjacency Graph: Let G denote a graph with n nodes, the ith node corresponding to data point Xi . k-NN is adopted since the parameter in -NN is hard to choose. Nodes i and j are connected if Xi is among k nearest neighbors of Xj or Xj is among k nearest neighbors of Xi . 2) Choosing the Weights: If there is an edge between nodes i and j, put a similarity weight Sij on it, otherwise zero. The similarity weight Sij is defined with heat kernel (rather than simple-minded of setting with one due to its performance) 2 dij Sij = exp − (1) 2r where dij = Xi − Xj , r > 0 is a suitable constant. 3) Eigenmaps: Compute eigenvalues and eigenvectors for the generalized eigenvector problem Lv = λDv

(2)

where D is diagonal matrix with elements Dii = j Sij , L = D − S is the Laplacian matrix. Let v0 , v1 , . . . , vd be the eigenvectors corresponding to the d + 1 smallest eigenvalues 0 = λ0 ≤ λ1 ≤ · · · ≤ λd . Omitting trivial v0 , the ith point Xi is mapped into Xi → Yi = (v1 (i), v2 (i), . . . , vd (i)).

(3)

Note that the similarity matrix S in LE only has nonzero entries among neighboring points. In practice, we find that the performance of LE heavily relies on the choice of neighbor number k. Fig. 2 also shows that LE fails to find correct low-dimensional map with improper neighbor number k. B. Learning New Similarity Without Constructing Neighborhood It is arbitrary to just set one fixed neighbor number k to construct adjacency graph, and it is also arbitrary to set the similarity between sample points which are not within possibly improper neighborhood be zero. Here we compute similarity Sij in (1) for all pair of sample points Xi and Xj , i, j = 1, 2, . . . , n. However, this similarity matrix S = (Sij ) does not reflect local manifold topology, which needs further amendment. Inspired by the idea of the first part of LLE [4], locally linear reconstruction of sample point by its neighbor points can represent local manifold structure. Let data matrix X = [X1 , X2 , . . . , Xn ], then the objective of locally linear reconstruction can be written as 2 n 2 Xi − min X W (4) j ji = X − XWF . Wji i=1 j∈N(i)


1747

If we restrict the reconstruction weights Wij to be nonnegative, Wij ≥ 0, then data point is reconstructed convexly by its neighbor points. By restricting further Wij to be symmetric, Wij = Wji , then the reconstruction weights behave similar to similarity. Closer sample points are more similar and the corresponding reconstruction weights are bigger, while distant sample points are less similar and the corresponding reconstruction weights are smaller [4]. Therefore, we add linear reconstruction restriction to learn new similarity between sample points, and the objective becomes min X − XW2F + αW − S2F W

s.t. Wij = Wji ≥ 0, Wii = 0

(5)

where combining parameter α > 0. Here we relax reconstruction weights Wij not to be zero outside neighborhood, while still keep diagonal elements Wii = 0(i = 1, . . . , n). Without neighborhood, as a compensation, we adopt LASSO type minimization of W1 to obtain the sparsity of W. Then the minimization of the final objective function J(W) of our learning new similarity turns into min X − XW2F + αW − S2F + βW1 W

s.t. Wij = Wji ≥ 0, Wii = 0

(6)

where combining parameter β > 0. Note that the diagonal elements Sii of S in LE represent the similarity between data points themselves, which are usually set as ones. However, each data point is usually excluded from its neighborhood, which leads to no self-loop in constructing the adjacency graph in the first step of LE. Therefore, the diagonal elements Sii of S in LE are usually set as zeros. In fact, suppose v be the generalized eigenvector of (2) corresponding to eigenvalue λ, then ((D − In ) − (S − In ))v = (D − S)v = Lv = λDv. The diagonal elements of S only contribute to the diagonal elements of D, but do not contribute to matrix L. Therefore, changing Sii from one to zero only results in changing the normalizer v Dv = 1 of generalized eigenvalue problem of (2) to v (D − In )v = 1. The experiments show that there is little difference of LE embedding performance between Sii being set as all ones and all zeros. To facilitate the following similarity learning processing, we change Sii from one to zero, Sii = 0, i = 1, 2, . . . , n. One can also change the diagonal elements of new learned similarity matrix W from zero to one when performing LE with new similarity (LEnewS). With the algorithms presented in Section III, we can obtain the new similarity matrix W without the help of neighborhood. Then, omitting the step of neighborhood construction completely and replacing S in the eigenvalue problem (2) with new learned similarity matrix W, we obtain the low-dimensional embedding map for all sample points.

Wij(t+1)

C. Simple Illustration First, we use the data shown in Fig. 1 where we argue that simmanifold (A, C) should be larger than simmanifold (A, B), even though distance (A, C) is larger than distance (A, B) [and thus for traditionally defined similarity simtrad (A, C) < simtrad (A, B)]. Indeed, the standard Gaussian heat kernel simtrad (A, B) = S(A, B) = 0.16039 >simtrad (A, C) = 3.0132e − 05. Applying the proposed manifold-based similarity learning method, we obtain simmanifold (A, B) = W(A, B) = 0.009066 MaxIterNum or |J(W (t) ) − J(W (t−1) )| < 7: return W (t) . Fig. 3. Variation of minimum swap distance of 1-D embedding of LE and LEnewS on C-shape and bad S-shape manifold data sets with different setting of parameter of neighborhood k in k-NN and in -NN.

simple unrolling of original manifold and obtains zero minimum swap distance, indicating preserving local topology perfectly. III. A LGORITHMS AND A NALYSIS In this section, we first provide an algorithm and accompanying analysis for the similarity learning that we present in the previous section for general mix-signed data. Then, for nonnegative data, we present a more efficient algorithm and accompanying analysis for the similarity learning. A. Algorithm for Mix-Signed Data When the elements of data matrix X are mix-signed (some positive, some negative), we learn the similarity matrix W via an iterative updating formula as in (7), shown at the bottom of the previous page. The whole iterative updating algorithm of similarity learning for mix-signed data is summarized in Algorithm 1. In the algorithm, there are three tuning parameters r, α and β besides data matrix X in inputs. However, in practice, we find that our algorithms are robust to parameters α and β. Therefore, we only set α = β = 1 in all the experiments of this paper. Theorem 1: The objective function in (6) decreases monotonically (i.e., it is nonincreasing) under the update rule (7) in Algorithm 1. Proof: The minimization of objective function (6) imposes the restriction Wij ≥ 0, which produces a constrained optimization problem. We first show that, the limiting solution of the update rule of (7) satisfies the Karush–Kuhn–Tucker (KKT) condition [41] at convergence. This is proved in Proposition 1 below. This property proves the correctness of the limiting solution. To prove that the iteration of the update rule (7) in Algorithm 1 converges, we adopt the method used in [42], where an upper-bound auxiliary function, which is differentiable and convex, is constructed to deduce the iteration update rule.

The objective function J(W) in (6) can be written as J(W) = Tr W T X T X + αI W β − 2 X T X + αS − E W + X T X + αST S (8) 2 where Tr(M) is the trace of matrix M, matrix I is n-by-n identity matrix, matrix E is n-by-n matrix with all one entries. If we let H = W T , A = X T X + αI, B = X T X + αS − (β)/(2)E, we can rewrite the objective function as J(H) = Tr HAH T − 2BH T = Tr HA+ H T − HA− H T − 2B+ H T + 2B− H T (9) where matrices A+ , A− , B+ , and B− can be any two pair matrices with all nonnegative entries satisfying A = A+ − A− and B = B+ − B− . Note that we need not restrict A+ and A− to be positive and negative part of A, A+ ij = (|Aij | + Aij )/2, − Aij = (|Aij | − Aij )/2. For any nonnegative number Cij ≥ 0, we − can set A+ ij = (|Aij | + Aij )/2 + Cij , Aij = (|Aij | − Aij )/2 + Cij . + − The same properties B and B hold. Following the method in [42], we formulate an auxiliary function for the objective J(H) in (9) as the following: 2 H A+ ik Hik

Z H, H =

Hik ik Hik Hil

1 + log − A− H H kl ik il

H

Hik il ikl 2 + H 2 Hik Hik ik

+ 2B+ B− . − ik Hik 1 + log H

ik

Hik ik ik

ik

(10) From [42], we know that Z(H, H ) satisfies J(H) ≤ Z(H, H ) and J(H) = Z(H, H). As a function of symmetric unknown random variable matrix H, we have the property of Proposition 2 below. Based on Proposition 2, we can obtain the global minimum of auxiliary function Z(H, H ) by (16). Define (11) H (t+1) = arg min Z H, H (t) . H


1749

Then, we have J(H (t) ) = Z(H (t) , H (t) ) ≥ Z(H (t+1) , ≥ J(H (t+1) ). Thus J(H (t) ) is monotone decreasing (nonincreasing) through each iterations. Substituting H = W T , A+ = (X T X)+ + αI, A− = (X T X)− , + B = (X T X)+ + αS and B− = (X T X)− + (β)/(2)E in formula (16) of Proposition 2, we obtain the updating (7) for similarity matrix W. Proposition 1: The limiting solution of the update rule in (7) satisfies the KKT condition [41] of optimization problem. Proof: We construct Lagrangian function for J(W) L(W) = J(W) − Tr T W (12) H (t) )

where the Lagrangian multipliers ik enforce nonnegative constraints Wik ≥ 0. Set its gradient be zero, (∂L(W))/(∂W) = (∂J(W))/(∂W) − = 0. From the complementary slackness condition, we get ∂J(W) Wij = ij Wij = 0. ∂Wij

(13)

It is a fixed point equation that the solution must satisfy at convergence. On the other hand, at convergence, W (+∞) = W (t+1) = (t) W = W in update (7), rearrange it

T + T + T − 2 Wij W X X + X X W + 2αW + 2 X X ij + β ij

− − = Wij2 W X T X + X T X W ij

+ . (14) + 2 X T X + αS ij

Note that X T X = (X T X)+ − (X T X)− , we get Wij2 W X T X + X T X W + 2αW ij ∂J(W) −2 X T X + αS ij + β = Wij2 = 0. ∂Wij

(15)

Equation (15) is equivalent to (13) since if Wij2 = 0, then Wij = 0, and vice versa. Thus if (15) holds, then (13) also holds, and vice versa. This proves that the limiting solution of the update rule of (7) satisfies the fixed point equation. Proposition 2: If H = H T and H = H T , then Z(H, H ) in (10) is a convex function of symmetric matrix H, and its global minimum is Hik = arg min Z H, H

H − + HA + H A− ki + B+ ik + Bki ik

= Hik + (16) −. H A ik + H A+ ki + B− ik + Bki Proof: Since Hik = Hki , we can get the first derivatives 2 H A+ ik Hik 2 H A+ ki Hik ∂Z H, H

= + ∂Hik H

H

ik − ik −

2 H A ik Hik 2 H A ki Hik − − Hik Hik

H H + ik − Hik − Hik ik − 2B+ ik H − 2Bki H + 2Bik H + 2Bki H . ik ik ik ik (17)

Algorithm 2 Algorithm of Learning Similarity for Nonnegative Data Input: A p-by-n data matrix X which contains n p-dimensional training samples, positive tuning parameters r, α and β. Output: An n-by-n similarity matrix W among n training samples. 1: Compute initial similarity matrix S, with its elements being heat kernels Sij = exp{−dij2 /2r}; (0) (0) 2: Set Wij = 1, Wii = 0, Sii = 0, i, j = 1, 2, . . . , n, t = 0; 3: repeat 4: For each i, j = 1, 2, . . . , n, update Wij(t+1) as in (21); 5: t = t + 1; 6: until t > MaxIterNum or |J(W (t) ) − J(W (t−1) )| < 7: return W (t) .

The second derivatives ∂ 2 Z H, H

= δij δkl ik ∂Hik ∂Hjl

(18)

where δij equals 1 if i = j otherwise 0, and + − H A ik + H A+ ki + B− ik + Bki ik = 2

Hik +

H A− ik + H A− ki + B+ Hik ik + Bki +2 . (19) 2 Hik The Hessian matrix is a diagonal matrix with positive entries. Thus, Z(H, H ) is a convex function of symmetric H. Therefore, we can obtain the global minimum by setting (∂Z(H, H ))/(∂Hik ) = 0 in (17) and solving for Hik . Rearranging it, we get + − Hik H A ik + H A+ ki + B− ik + Bki H

ik

− − + Hik = H A ik + H A ki + B+ + B ik ki H ik

(20)

which is equivalent to (16). B. Algorithm for Nonnegative Data If all the entries of data matrix X are nonnegative, we propose a more efficient algorithm to learn the similarity matrix W 2 X T X + αS ij (t+1) (t) = Wij (t) T . (21) Wij W X X + X T XW (t) + 2αW (t) ij + β The whole similarity learning algorithm for nonnegative data is summarized in Algorithm 2. Once again, although we list parameters α and β as inputs, we only set α = β = 1 in all the experiments of this paper. Theorem 2: For nonnegative data, the objective function in (6) decreases monotonically (i.e., it is nonincreasing) under the update rule (21) in Algorithm 2. Proof: When all the elements of data matrix X are nonnegative, then A = X T X + αI = A+ . The objective function J(W)

1750


in matrix trace form (8) can be rewritten as J(W) = Tr W T AW − 2W T B = Tr W T A+ W − 2W T B+ + 2W T B−

To prove the semi-positive definiteness, consider the matrix

(22)

where A = A+ = X T X + αI, B+ = X T X + αS and B− = (β)/(2)E are all nonnegative matrices. We follow the method adopted in [43] to deduce the iteration update rule and construct an upper-bound auxiliary function, which is differentiable and convex. Define 2 A+ W ik + 2B− ik (23) K W = Kik = Wik

T G W, W = J W + Tr W − W ∇J W

T 1 + Vec W − W diag Vec K W

2 (24) Vec W − W

where Vec(M) is vectorization of matrix M which concatenates columns of M one by one to form a large column vector, diag(v) is matrix diagonalization of vector v which constructs a diagonal matrix with the elements of v as diagonal elements. Then based on Proposition 3 below, we know that G(W, W ) in (24) is an auxiliary function of J(W) in (22). Define W (t+1) = arg min G W, W (t) (25) W

then J(W (t) ) = G(W (t) , W (t) ) ≥ G(W (t+1) , W (t) ) ≥ J(W (t+1) . Therefore, J(W (t) ) is monotone decreasing (nonincreasing) as t becomes larger. Based on the Proposition 4 below, we can obtain the global minimum of G(W, W ) by (30). Then, substituting A+ = X T X + αI, B+ = X T X + αS and B− = (β)/(2)E in iterating (30) of Proposition 4, and we recover the updating (21). Proposition 3: G(W, W ) in (24) is an auxiliary function of J(W) in (22), i.e., G(W, W ) ≥ J(W) and G(W, W) = J(W). Proof: It is obvious that G(W, W) = J(W). To prove G(W, W ) ≥ J(W), rewrite the Taylor expansion of J(W) at W

Wik − Wik 2AW − 2B ik J(W) = J W + ik

1 Wik − Wik δkl 2Aij Wjl − Wjl

+ 2 ik jl

T = J W + Tr W − W ∇J W

T + Tr W − W A W − W

T = J W + Tr W − W ∇J W

T 1 Vec W − W (I ⊗ 2A) Vec W − W

(26) 2 where ⊗ is Kronecker product of matrices. Then, based on (24) and (26), G(W, W ) ≥ J(W) is equivalent to T Vec W − W

diag Vec K W

− I ⊗ 2A Vec(W − W ) ≥ 0. (27) +

T diag Vec K W

M = diag Vec W

− I ⊗ 2A diag Vec W (28) whose element Mab is just a positive rescaling of the components of (diag(Vec(K(W ))) − I ⊗ 2A). Then (diag(Vec(K(W ))) − I ⊗ 2A) is semi-positive definite if and only if M is. Since vT Mv =

ik

=

vik Wik δij δkl Kik − δkl 2Aij Wjl vjl

jl

Wik2 Kik v2ik −

ik

=

ijk

2Aij Wjk Wik v2ik

+

Wik 2Aij Wjk vik vjk

ijk

−

ik

2Aij Wik Wjk vik vjk

ijk

=

2Aij Wik Wjk

v2ik + v2jk

ijk

+

2 2 2B− ik Wik vik

2

− vik vjk

2 2 2B− ik Wik vik

ik

=

2 − 2 2 Aij Wik Wjk vik−vjk + 2Bik Wik vik ≥ 0.

ijk

ik

(29) Therefore, G(W, W ) ≥ J(W), and G(W, W ) is the auxiliary function of J(W). Proposition 4: If W = W T and W = W T , then G(W, W ) in (24) is a convex function of symmetric matrix W, and its global minimum is Wik = arg min G(W, W ) W

= Wik

B+ + B+ ki ik −. (A+ W )ik + A+ W ki + B− ik + Bki

(30)

Proof: Since Wik = Wki , the derivatives of G(W, W ) ∂G W, W

= ∇J W ik + ∇J W ki ∂Wik + Kik W − W ik + Kki W − W ki = ∇J W ik + ∇J W ki + (Kik + Kki ) W − W ik . (31) The second derivatives in the Hessian matrix ∂ 2 G W, W

= δij δkl (Kik + Kki ). ∂Wik ∂Wjl

(32)

Thus the Hessian matrix is a diagonal matrix with positive entries. Therefore, G(W, W ) is a convex function of symmetric matrix W. We can obtain its global minimum by setting (∂G(W, W ))/(∂Wik ) = 0 in (31) and solve it for Wik .


Fig. 4. Convergence performance comparison of the two similarity learning algorithms [algorithm for mix-signed data “Alg.(mix-signed)” and algorithm for nonnegative data “Alg.(nonnegative)”] on previous C-shape and S-shape manifold data sets.

Through rearranging, we obtain Wik = Wik − (Kik + Kki )−1 ∇J W ik + ∇J W ki W

ik = Wik − + − 2 A W ik + A+ W ki + B− ik + Bki + − 2B+ 2 A W ik + 2B− + ik − ik + + 2 A W ki + 2Bki − 2Bki B+ + B+ ki ik = Wik + −. A W ik + A+ W ki + B− ik + Bki

This completes the proof. C. Relation Between These Two Algorithms The iterative updating (21) in Algorithm 2 can only be applied to nonnegative data matrix X, while the iterative updating (7) in Algorithm 1 has no this restriction. When all the elements of data matrix X are nonnegative, we can adopt both of these two algorithms. In fact, if all the elements of X are nonnegative, then all the elements of X T X are nonnegative. Therefore, we can set (X T X)− ij = 0 in (7). Then, the iterative updating (21) is equivalent to (7) with updating the Wij in multiplication factors at every two iterative loops. Fig. 4 shows the variations of objective function value by these two algorithms on previous C-shape and S-shape manifold data sets. From the figures, we can see that both of these two algorithms converge to the same minimum. Both of them converge very fast, while Algorithm 2 is more efficient. To speed up the learning iteration of similarity for mixsigned data, especially for negative data, one can shift the data to nonnegative by adding some positive constant to all elements of data matrix X, and then apply the Algorithm 2 to learn similarity for nonnegative data. D. Computational Complexity The computational complexity of Algorithm 1 for learning similarity among mix-signed data mainly locates in iteratively computing (7). The matrices (X T X)+ , (X T X)− , and 2[(X T X)+ + αS] can be computed before the iteration. Note that W, (X T X)+ , and (X T X)− are all symmetric matrices. we can get [W (t) (X T X)+ ]ij = [(X T X)+ W (t) ]ji , [W (t) (X T X)− ]ij = [(X T X)− W (t) ]ji . Therefore, there are only

1751

two multiplications of n-by-n matrices dominating the computational load of (7), whose computational complexity order are O(2n3 ). The computational complexity of Algorithm 2 for learning similarity among nonnegative data mainly locates in iteratively computing (21). The matrices X T X and 2[X T X + αS] can be computed before the iteration. Note that (W (t) X T X)T = X T XW (t) . Therefore, there is only one multiplication of two n-by-n matrices dominating the computational load of (21), whose computational complexity order is O(n3 ). The k-NN search in computing traditional LE heat kernel similarity matrix is the order of O(kn log n). However, the parameter k in k-NN or in -NN is hard to be determined beforehand for the best performance of LE embedding. Usually, cross-validation is adopted to choose the parameter k or , which will cost a lot of time. IV. L EARNING N EW S IMILARITY IN K ERNEL S PACES Considering a nonlinear map: φ : Rp → F, Xi → φ(Xi ), which maps Euclidean data features into some potentially high- (and possibly infinite-) dimensional feature space F. Now we learn similarity among data in F space. The mapped data matrix is φ(X) = [φ(X1 ), φ(X2 ), . . . , φ(Xn )]. The minimization objective function becomes φ(X) − φ(X)W2F + αW − S2F + βW1 = Tr W T φ(X)T φ(X) + αI W β T − 2 φ(X) φ(X) + αS − E W 2 + φ(X)T φ(X) + αST S .

(33)

In implementation, the mapping φ does not need to be computed explicitly. The φ mapping and F space are determined implicitly by choosing a proper kernel function k, which equals to the dot product between two mapped data samples φ(Xi ) and φ(Xj ) in F space by k Xi , Xj = φ(Xi ) · φ Xj . (34) Based on Mercer’s theorem of functional analysis, if k is a positive definite kernel, then there exists a mapping φ into a dot product space F such that (34) holds. Fortunately, our objective function of learning similarity in (33) only depends on the kernel K = φ(X)T φ(X). Furthermore, the iterative updating (7) and (21) in Algorithms 1 and 2 depend only on X T X. Thus, it is possible to learn similarity in kernel space. Note that most of common used kernel functions, such as Gaussian kernel and Cosine kernel, are nonnegative. Replacing X T X with kernel matrix K in the iterative updating (21) for nonnegative data, we get (t+1)

Wij

2 [K + αS]ij (t) = Wij (t) . W K + KW (t) + 2αW (t) ij + β

(35)

By replacing (21) in Algorithm 2 for nonnegative data with (35), we obtain the algorithm for learning similarity in kernel space.

1752


Fig. 5. Effects of tuning parameters α, β, and σ on minimum swap distance between LEnewS 1-D embedding data order and the expected unrolling of original manifold data order on the C-shape and bad S-shape datasets.

V. E XPERIMENTS From Section III-C, we can see that both Algorithms 1 and 2 converge to the same new similarity matrix while Algorithm 2 is more faster. In the following experiments, we simply adopt Algorithm 2 to learn new similarity matrix and test the corresponding performance. All the experiments are conducted on a Intel i7 CPU Q720 1.60 Ghz with 4 GB memory. MATLAB version is 7.8(2009a). A. Effects of Tuning Parameters There are three tuning parameters α, β, and r in Algorithm 2. In most of the experiments, we simply set α = β = 1 and 2r = σ mini (maxj (dij2 )), where σ = 0.02. To test the effect of each tuning parameter, we simply fix the other tuning parameters as above and vary the selected tuning parameter. Fig. 5 shows the variation of minimum swap (adjacent data points) distance between LE 1-D embedding data order and the expected unrolling of original manifold data order on the C-shape and bad S-shape datasets with different setting of tuning parameters α, β, and σ . From the figure, we can see that when α is set between one and two, the obtained similarity matrices are stable and lead to zero minimum swap distances. When α = 0, which turns the objective function (6) to the method of L1 -sparse coding, the obtained similarity matrices are bad since the corresponding minimum swap distances of LEnewS are nonzero. It is robust when parameter β is set around one. Too large β, which emphasizes too much on sparsity, will get bad similarity matrices, and degrade the performance of LEnewS. However, when β = 0, which omits the sparsity constraint of similarity

Fig. 6. 2-D embedding of LE and LEnewS on 3-D manifold benchmarks with only one outlier.

matrix W in (6), the learnt similarity is not robust for the performance of LEnewS. This shows that the third term of sparsity constraint in the objective function (6) is necessary. The performance of similarity varies on different datasets when different Gaussian heat kernel parameter σ (and the corresponding parameter r) is set. Usually, σ setting around 0.02 will produce better performance. Like that used in traditional LE, the Gaussian heat kernel parameter σ in the proposed methods should be chosen carefully. B. Embedding of Noisy 3-D Manifold Benchmarks Typical manifold benchmarks of dimensionality reduction from the literature, such as Swiss rolls and punctured sphere, are expected to be mapped from 3-D to 2-D for extracting embedding. LE with parameter k = 8 (or 6, or 12) in k-NN has been verified effective for some mappings. However, when there are some noises in these manifold benchmarks, LE may lack robustness to these noises.


1753

(a)

(b) Fig. 7. 2-D visualizations of classical LE (left) and LEnewS (right) on two handwritten letter and digit data sets. (a) Binary aplhadigits. (b) Mixed national institute of standards and technology (MNIST).

In this section, we test the robustness of LEnewS on eight typical manifold benchmarks, i.e., Swiss roll, Swiss hole, corner planes, punctured sphere, twin peaks, 3-D clusters, toroidal Helix, and Gaussian, with only one outlier as noise. Overall n = 800 data points are generated for each manifold benchmarks with default settings.1 The outlier for each benchmark is simply generated by the farthest vertex of external cube of the manifold benchmark. LEnewS is learned on the noisy 3-D manifold benchmarks and then the noisy manifold benchmarks are embedded into 2-D subspaces. LE using k-NN and -NN similarity matrices are also tested for comparison. The parameters of k-NN and -NN are tuned so that LE gives the best results. Fig. 6 shows the 2-D embedding results of LE and LEnewS with each row corresponding to one manifold benchmark. From the figure, we can see that LE is very fragile to outlier/noise. LEnewS is more robust to the outlier/noise and can effectively find the proper low-dimensional embedding. C. Visualization To show the low-dimensional embedding performance of LEnewS, we use 2-D visualization to show the structure of data points. We select four letters (“C,” “P,” “X,” and “Z”) from binary aplhadigits data set and four digits (“0,” “3,” “6,” and “9”) from MNIST Handwritten Digits data set. Both of them can be downloaded from the web page of Sam Roweis.2

We tune the parameters of LE, neighbor number k and sigma factor r in computing heat kernel similarity, to let LE be in the best performance. For LEnewS, the sigma factor r in computing heat kernel similarity is chosen the same as that of LE, while the combining parameters α and β are both set to be one. The 2-D embedding results are shown in Fig. 7(a) and (b). The left figures in Fig. 7(a) and (b) are the best results of LE. From the figures, we can see that “C” and “P” in binary alphadigits, and “0” and “6” in MNIST are still collapsed together. However, in the right figures of LEnewS, the image data points are distributed more evenly and with much more clear visualizations. D. Classification of LEnewS To demonstrate the classification performance of LE with new learnt similarity, we conduct extensive experiments on five image databases and ten data sets taken from UCI machine learning repository.3 The five image databases consists of three famous face databases (ORL,4 Yale,5 and YaleB6 ), a toy image database COIL207 and a facial expression database JAFFE.8 The ten UCI data sets include iris, wine, car, sonar, ionosphere, glass, ecoli, zoo, image, and Statlog. Details about these databases can be found on the left side of Table I. 3 http://archive.ics.uci.edu/ml/ 4 http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html 5 http://cvc.yale.edu/projects/yalefaces/yalefaces.html

1 http://www.math.ucla.edu/∼wittman/mani/

6 http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html 7 http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

2 http://www.cs.nyu.edu/∼roweis/data.html

8 http://www.kasrl.org/jaffe.html

1754


TABLE I C LASSIFICATION ACCURACY OF LE AND LE NEW S

Fig. 9. Classification performance variations of classical Laplacianfaces (PCA+LPP) and Laplacianfaces with new similarity (PCA+LPPnewS) on four image databases along with different number of training samples per subject.

space of LE and LEnewS are reported in the right side of Table I. From the results, we can see that LEnewS tend to outperform the best of LE consistently. E. Classification of LPP With New Similarity

Fig. 8. Classification performance variations of classical LE and LEnewS on four image databases along with different number of embedded dimension.

The neighbor number k and sigma factor r in computing heat kernel similarity of LE are tuned such that LE reach its best classification performance. The sigma factor r in computing heat kernel similarity of LEnewS is chosen to be the same as that of LE. The combining parameters in LEnewS α = β = 1. Leave-one-out cross-validation strategy, which is widely used in the nonlinear manifold learning, is adopted in this experiment. The classification accuracy is computed by the nearest neighbor classifier in embedded low-dimensional subspace with Euclidean distance as metric. We first investigate the variation of classification performance along with different choice of embedded dimension. Fig. 8 shows the variation of classification accuracy of LE and LEnewS on the ORL, Yale, YaleB, and COIL20 databases. From the figure, we can see that LEnewS outperforms LE consistently along with different embedded dimension. Then for all the 15 databases, the embedded dimension is tuned such that the LE reaches its best classification accuracy. The results of top classification accuracy on the embedded

Since LE is a nonlinear manifold learning method, it is hard to generalize to embed data points outside training set. To test the generalization performance of the proposed similarity learning method, we perform the linear extension of LE, LPP [38], with new similarity, which is denoted as LPPnewS. Four image databases, ORL, Yale, YaleB, and COIL20, are selected for testing. Each database is randomly split into training set and testing set with different number of images per subject for training. We train Laplacianfaces (PCA+LPP) [39] and Laplacianfaces with new similarity (PCA+LPPnewS) on training sets to obtain linear projection matrices, project testing sets into low-dimensional subspaces, and test the classification performance of Laplacianfaces and PCA+LPPnewS on embedded testing sets. The classification accuracy is computed by the nearest neighbor classifier in embedded low-dimensional subspace with Euclidean distance as metric. Fig. 9 shows the classification performances of Laplacianfaces and PCA+LPPnewS with different number of images per subject for training. From the figure, we can see that Laplacianfaces with new similarity outperforms classical Laplacianfaces consistently. F. Classification of LEnewKS To test the performance of the new similarity learnt in kernel spaces computed in (35), we test the classification accuracy of LE with new similarity learnt in different kernel spaces (LEnewKS) with different kernel functions on ORL database. Several kernel functions (linear, polynomial, cosine, chi-squared, and Gaussian) are adopted in the experiments. We first test LEnewKS with ten-fold cross-validation as different number of embedded dimension is chosen, then we test LEnewKS when different number of samples per person is


Fig. 10. Classification performances of LEnewKS with different kernel functions.

taken as training set. Fig. 10 shows the classification performance variations of LEnewKS with different kernel functions. From the figure, we can see that LEnewKS outperforms classical LE consistently with most of kernel functions. LEnewKS with Gaussian kernel function achieves the best results. VI. C ONCLUSION We learned the similarity among sample points of manifold based on linear reconstruction and L1 minimization. The step of constructing adjacency graph for neighborhood, which many manifold learning methods adopted, was omitted completely. Two algorithms and corresponding analyses were presented to learn similarity for mix-signed and nonnegative data respectively. We also extended the method to kernel space. The experimental results on both synthetic and real world benchmark data sets demonstrate that LE and LPPs with new similarity provides a better representation (visualization) and achieves higher accuracy in classification. Since, Algorithm 2 for nonnegative data is more efficient than Algorithm 1 for mixed-sign data, one may shift the data to nonnegative by adding some positive constant to all elements of data matrix. Whether augment the feature dimensionality by appending appropriate bias (constant feature) or not should be investigated. Since adding constant feature will improve the similarity between data points, how the augmented constant feature affects the similarity will be studied in the future. ACKNOWLEDGMENT The authors would like to thank the Editor-in-Chief, the handling Associate Editor, and the anonymous reviewers for their support and constructive comments on this paper. R EFERENCES [1] I. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: Springer, 2002. [2] T. F. Cox and M. A. A. Cox, Multidimensional Scaling, 2nd ed. New York, NY, USA: Chapman & Hall, 2001. [3] J. B. Tenenbaum, V. D. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, Dec. 2000. [4] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, Dec. 2000. [5] Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM J. Sci. Comput., vol. 26, no. 1, pp. 313–338, 2004. [6] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, Jun. 2003.

1755

[7] O. Kouropteva, O. Okun, and M. Pietikainen, “Selection of the optimal parameter value for the locally linear embedding algorithm,” in Proc. 1st Int. Conf. Fuzzy Syst. Knowl. Discov., 2002, pp. 359–363. [8] O. Samko, A. Marshall, and P. Rosin, “Selection of the optimal parameter value for the ISOMAP algorithm,” Pattern Recognit. Lett., vol. 27, no. 9, pp. 968–979, 2006. [9] G. Wen, L. Jiang, and J. Wen, “Using locally estimated geodesic distance to optimize neighborhood graph for isometric data embedding,” Pattern Recognit., vol. 41, no. 7, pp. 2226–2236, Jul. 2008. [10] G. Wen, “Relative transformation-based neighborhood optimization for isometric embedding,” Neurocomputing, vol. 72, nos. 4–6, pp. 1205–1213, 2009. [11] X. Gao and J. Liang, “The dynamical neighborhood selection based on the sampling density and manifold curvature for isometric data embedding,” Pattern Recognit. Lett., vol. 32, no. 2, pp. 202–209, 2011. [12] Z. Zhang, J. Wang, and H. Zha, “Adaptive manifold learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 2, pp. 253–265, Feb. 2012. [13] D. Tarlow, K. Swersky, I. Charlin, L. Sutskever, and R. S. Zemel, “Stochastic K-neighborhood selection for supervised and unsupervised learning,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 199–207. [14] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity through ranking,” J. Mach. Learn. Res., vol. 11, pp. 1109–1135, Mar. 2010. [15] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., vol. 18. 2006, pp. 1473–1480. [16] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 209–216. [17] C. Chen, J. Zhang, and R. Fleischer, “Distance approximating dimension reduction of riemannian manifolds,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 40, no. 1, pp. 208–217, Feb. 2010. [18] Y. Xiao, B. Liu, Z. Hao, and L. Cao, “A similarity-based classification framework for multiple-instance learning,” IEEE Trans. Cybern., vol. 44, no. 4, pp. 500–515, Apr. 2014. [19] J. Song, Y. Yang, X. Li, Z. Huang, and Y. Yang, “Robust hashing with local models for approximate similarity search,” IEEE Trans. Cybern., vol. 44, no. 7, pp. 1225–1236, Jul. 2014. [20] W. Bian and D. Tao, “Constrained empirical risk minimization framework for distance metric learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1194–1205, Aug. 2012. [21] A. Bellet, A. Habrard, and M. Sebban, “Similarity learning for provably accurate sparse linear classification,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 1871–1878. [22] A. Bellet, A. Habrard, and M. Sebban, “Good edit similarity learning by loss minimization,” Mach. Learn., vol. 89, nos. 1–2, pp. 5–35, 2012. [23] L. Cheng, “Riemannian similarity learning,” in Proc. Int. Conf. Mach. Learn. (ICML), 2013, pp. 540–548. [24] D. Kedem, S. Tyree, K. Q. Weinberger, F. Sha, and G. R. G. Lanckriet, “Non-linear metric learning,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 2582–2590. [25] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair, “Learning hierarchical similarity metrics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, 2012, pp. 2280–2287. [26] A. Mignon and F. Jurie, “PCCA: A new approach for distance learning from sparse pairwise constraints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, 2012, pp. 2666–2672. [27] B. Cao, X. Ni, J.-T. Sun, G. Wang, and Q. Yang, “Distance metric learning under covariate shift,” in Proc. 22nd Int. Joint Conf. Artif. Intell. (IJCAI), 2011, pp. 1204–1210. [28] B. Shaw, B. C. Huang, and T. Jebara, “Learning a distance metric from a network,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2011, pp. 1899–1907. [29] K. Park, C. Shen, Z. Hao, and J. Kim, “Efficiently learning a distance metric for large margin nearest neighbor classification,” in Proc. 25th AAAI Conf. Artif. Intell. (AAAI), 2011, pp. 453–458. [30] Y. Hong, Q. Li, J. Jiang, and Z. Tu, “Learning a mixture of sparse distance metrics for classification and dimensionality reduction,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Barcelona, Spain, 2011, pp. 906–913. [31] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Metric learning for large scale image classification: Generalizing to new classes at nearzero cost,” in Proc. 12th Eur. Conf. Comput. Vis., Florence, Italy, 2012, pp. 488–501.

1756

[32] M. S. Baghshah and S. B. Shouraki, “Semi-supervised metric learning using pairwise constraints,” in Proc. 21st Int. Joint Conf. Artif. Intell. (IJCAI), 2009, pp. 1217–1222. [33] G. Niu, B. Dai, M. Yamada, and M. Sugiyama, “Information-theoretic semi-supervised metric learning via entropy regularization,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 89–96. [34] G. Zhong, K. Huang, and C.-L. Liu, “Low rank metric learning with manifold regularization,” in Proc. IEEE 11th Int. Conf. Data Min. (ICDM), Vancouver, BC, Canada, 2011, pp. 1266–1271. [35] M. Maggini, S. Melacci, and L. Sarti, “Learning from pairwise constraints by similarity neural networks,” Neural Netw., vol. 26, pp. 141–158, Feb. 2012. [36] Y. Tang, L. Li, and X. Li, “Learning similarity with multikernel method,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 131–138, Feb. 2011. [37] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. B, vol. 58, no. 1, pp. 267–288, 1994. [38] X. He and P. Niyogi, “Locality preserving projections,” in Proc. Conf. Adv. Neural Inf. Process. Syst., 2003, pp. 153–160. [39] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using Laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 3, pp. 328–340, Mar. 2005. [40] D. Luo, C. H. Q. Ding, F. Nie, and H. Huang, “Cauchy graph embedding,” in Proc. 28th Int. Conf. Mach. Learn. (ICML), 2011, pp. 553–560. [41] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [42] C. Ding, T. Li, and M. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 45–55, Jan. 2010. [43] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 13. 2000, pp. 556–562.

Si-Bao Chen received the B.S. and M.S. degrees in probability and statistics and the Ph.D. degree in computer science from Anhui University, Hefei, China, in 2000, 2003, and 2006, respectively. From 2006 to 2008, he was a Post-Doctoral Researcher at the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei. He has been at Anhui University since 2008, and is currently an Associate Professor. His current research interests include pattern recognition, machine learning, image processing, and computer vision.


Chris H. Q. Ding received the Ph.D. degree from Columbia University, New York, NY, USA, in 1987. He was with California Institute of Technology, Pasadena, CA, USA, Jet Propulsion Laboratory, Pasadena, CA, USA, and Lawrence Berkeley National Laboratory, Berkeley, CA, USA. From 2007, he has been with the University of Texas at Arlington, Arlington, TX, USA. His current research interests include machine learning, data mining, bioinformatics, information retrieval, web link analysis, and high performance computing. He has published about 200 papers, which has over 13 000 Google Scholar citations.

Bin Luo received the B.Eng. degree in electronics and M.Eng. degree in computer science from Anhui University, Hefei, China, in 1984 and 1991, respectively, and the Ph.D. degree in computer science from the University of York, York, U.K., in 2002. From 1996 to 1997, he was awarded the British Council Visiting Scholar under the Sino-British Friendship Scholarship Scheme. He was a Research Associate with the University of York from 2000 to 2004 and a Research Fellow at British Telecom, London, U.K., in 2006. He visited the University of Greenwich, London, U.K., in 2007, as a Visiting Professor, and the University of New South Wales, Sydney, NSW, Australia, in 2008 as a Visiting Fellow. He has published 160 papers in journals, edited books, and refereed conferences. He is currently a Professor with Anhui University. His current research interests include graph spectral analysis, large image database retrieval, image and graph matching, statistical pattern recognition, digital watermarking, and information security. Prof. Luo is the Chair of the IEEE Hefei Subsection. He was one of the General Chair of the International Symposium on Information Technologies and Applications in Education, Xiamen, China, in 2008. He served as a Reviewer of the international academic journals, including the IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, Pattern Recognition, Pattern Recognition Letters, the International Journal of Pattern Recognition and Artificial Intelligence, and Knowledge and Information Systems and Neurocomputing. He was in the program committee of the international academic conferences such as GMPR08, MPIS08, FGCN08, Applied Computing08, ICMLC08, ICMLC2007, and ICNC2007.

Coupled attribute similarity learning on categorical data.

Manifold Learning for Multivariate Variable-Length Sequences With an Application to Similarity Search.

Scattered manifold-valued data approximation.

An algorithm for finding biologically significant features in microarray data based on a priori manifold learning.

Manifold Learning by Preserving Distance Orders.

Unraveling flow patterns through nonlinear manifold learning.

Hierarchical manifold learning for regional image analysis.

Massively parallel unsupervised single-particle cryo-EM data clustering via statistical manifold learning.

Clustering Tree-Structured Data on Manifold.

Manifold learning for object tracking with multiple nonlinear models.

Robust head pose estimation via supervised manifold learning.

Manifold angles, the concept of self-similarity, and angle-enhanced bifurcation diagrams.

Manifold learning based registration algorithms applied to multimodal images.

EEG-based emotion recognition with manifold regularized extreme learning machine.

Multimodal manifold-regularized transfer learning for MCI conversion prediction.

Alzheimer's Disease Early Diagnosis Using Manifold-Based Semi-Supervised Learning.

Sampling from Determinantal Point Processes for Scalable Manifold Learning.

Manifold regularized multitask feature learning for multimodality disease classification.

Manifold learning on brain functional networks in aging.

Manifold learning based ECG-free free-breathing cardiac CINE MRI.

Predicting gene function using similarity learning.

System light-loading technology for mHealth: Manifold-learning-based medical data cleansing and clinical trials in WE-CARE Project.

Guaranteed classification via regularized similarity learning.

Similarity screening of molecular data sets.