IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

1795

TPSLVM: A Dimensionality Reduction Algorithm Based On Thin Plate Splines Xinwei Jiang, Junbin Gao, Tianjiang Wang, and Daming Shi

Abstract—Dimensionality reduction (DR) has been considered as one of the most significant tools for data analysis. One type of DR algorithms is based on latent variable models (LVM). LVM-based models can handle the preimage problem easily. In this paper we propose a new LVM-based DR model, named thin plate spline latent variable model (TPSLVM). Compared to the well-known Gaussian process latent variable model (GPLVM), our proposed TPSLVM is more powerful especially when the dimensionality of the latent space is low. Also, TPSLVM is robust to shift and rotation. This paper investigates two extensions of TPSLVM, i.e., the back-constrained TPSLVM (BC-TPSLVM) and TPSLVM with dynamics (TPSLVM-DM) as well as their combination BC-TPSLVM-DM. Experimental results show that TPSLVM and its extensions provide better data visualization and more efficient dimensionality reduction compared to PCA, GPLVM, ISOMAP, etc. Index Terms—Data visualization, dimensionality reduction, latent variable models, unsupervised learning.

I. Introduction

E

XTRACTING low-dimensional structure in highdimensional data often arises in modern applications like data clustering and visualization. Dimensionality reduction (DR), which maps data points in a high-dimensional space into embeddings in a low-dimensional space, is an important aspect of many statistical learning tasks [1]. The curse of dimensionality, to which the phenomenon of learning from high-dimensional data is often referred, leads to such problems as high computation and storage costs, ineffective analysis, inefficient transmission and complex relational structures. DR holds the key to tackling this practical problem and is a common and often necessary step in most

Manuscript received July 19, 2012; revised July 29, 2013; accepted December 9, 2013. Date of publication January 2, 2014; date of current version September 12, 2014. This work was supported in part by the Australian Research Council (ARC) through Grant DP130100364, in part by the Fundamental Research Funds for the Central Universities, China University of Geosciences, and in part by the National Natural Science Foundation of China under Grant 61073094. This paper was recommended by Associate Editor F. Hoffmann. X. Jiang is with the School of Computer Science, China University of Geosciences, Wuhan 430074, China (e-mail:[email protected]). J. Gao is with the School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia (e-mail:[email protected]). T. Wang is with School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]). D. Shi is with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China (e-mail:[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2013.2295329

machine learning applications involving high-dimensional data analysis, such as biological engineering [2] and computer vision [3] where high-dimensional data are often easily generated and difficultly analyzed in many circumstances. DR [1] is also considered as one of the most significant preprocessing steps of data analysis. In order to analyze complex data, plenty of machine learning algorithms have been developed to discover the low-dimensional embeddings of data manifolds lying in a high-dimensional observed space. Traditionally, DR methods can be categorized into two major classes: the supervised algorithms [4] including the conventional linear discriminant analysis (LDA) and the unsupervised ones [5], [6] including the classical principal component analysis (PCA). In the past two decades, researchers have mainly focused on the unsupervised learning for dimensionality reduction tasks. The classical PCA is the most well-known DR model, which tries to model the linear relationship between the high-dimensional observed space and the low-dimensional space by maximizing the variance of the projected low-dimensional data. The process learns a linear mapping from high-dimensional space to the low-dimensional space. In 1999, Tipping and Bishop [7] made an assumption over the process of data generation in which the observed high-dimensional data are generated from the low-dimensional latent variable. Thus, the highdimensional data are considered to be dominated by a few generating factors/features that depict the low-dimensional counterparts of high-dimensional observation. Hence, the algorithm of learning latent variables is regarded as a DR algorithm. Under the linear generation mechanism, Tipping and Bishop proposed the so-called probabilistic principle component analysis (PPCA) that models the mapping from a low-dimensional space (latent variables) to a high-dimensional space (observed variables). This technique is called latent variable models (LVM) in which a probabilistic framework is introduced to model the data generative procedure. And the model learning is conducted, typically, by using Bayesian learning with certain assumptions, e.g., the low-dimensional latent variables obey Gaussian distributions, which also can be seen as a special factor analysis (FA) model [7]. The core of the latent variable model is to learn the generative process of the high-dimensional data under a model assumption while a direct DR algorithm, such as the classical PCA, aims to learn the corresponding low-dimensional projection from high-dimensional observation data directly.

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1796

The advantage of PCA/PPCA is that they are easily understandable and tractable, and in many cases their performance is satisfactory. However, real world data may not meet the linear assumption. When the linear assumption is violated, the performance and results could become unsatisfactory [8]. To address this issue, researchers have been working on nonlinear DR models, such as MDS [9], Sammon mapping [10], ISOMAP [5], local linear embedding (LLE) [6], Kernel PCA (KPCA) [11], locality preserving projection (LPP) [12], local spline embedding (LSE) [13], flexible manifold embedding (FME) [14], cauchy graph embedding [15], Gaussian process latent variable model (GPLVM) [16], and relevance units latent variable model (RULVM) [17]. There are two major classes for non-linear DR algorithm design: the spectral approaches and LVM-based methods. The first category includes MDS, Sammon mapping, ISOMAP, LLE, and KPCA. They offer algorithms based on spectral analysis, in which one tries to use proximity data to obtain the lowdimensional embedding. In these methods, generally speaking, information of the high-dimensional data is summarized in a similarity/dissimilarity matrix such as a kernel/distance matrix, and then the low-dimensional embedded structure is revealed by conducting eigen-analysis. Each method has its own definition of building the similarity/dissimilarity matrix that encodes information contained in the observation data. Solving a spectral DR model is typically fast without iterative optimization, and in many circumstances the performance is desirable somewhat. However, because spectral decomposition techniques are utilized for directly obtaining the final lowerdimensional embedding of the high-dimensional observation data and the corresponding algorithms run in a batch mode and once for all the input data points, the main drawback of these kinds of models is that most of them are incapable of projecting a new high-dimensional data into its low-dimensional embedding (this is known as out-of-sample problem or the backward mapping problem) and incapable of projecting back from the low-dimensional space to the observed data space (this is known as the preimage problem or the forward mapping problem), although there are some extensions [18] and novel DR models [12], [13], [19] trying to address the two problems. The out-of-sample problem is critical for handling new data without the necessity to relearn the whole low-dimensional manifold, which is computationally expensive and performed off-line. The preimage capability is critical to working back in the original observation space, either for visualization, such as using KPCA to de-noise the images, or for operations such as tracking in the video space [20]. The reason why spectral approaches suffer from the two problems is that the aforementioned methods do not offer a mechanism to learn a mapping between low-dimensional and high-dimensional spaces. Learning the mappings is very important in many applications. For example, a mapping from low-dimensional space to high-dimensional is very useful in data generation and interpolation tasks. Lawrence [16] and van der Maaten et al. [1] provide a detailed comparison of these spectral approaches. As an alternative approach, a latent variable model is capable of learning a mapping from low-dimensional space

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

to the high-dimensional observation space. Algorithms based on latent variable models will be termed as LVM-based DR models throughout the paper. PPCA, GPLVM, and RULVM can be viewed as the most representative DR models in this category. In the setting of LVM-based DR models, given the observation data set with high dimension Y = {yn }N n=1 , the goal is to learn the corresponding low-dimensional points X = {xn }N n=1 by modeling y = f (x) with an unknown function f . X’s are typically termed by latent variables as in the unsupervised task they are initially unknown. Basically we can consider this kind of model as the general regression model in supervised learning. The only difference between LVM and a general regression model is that in LVM, the covariates X are to be learned along with the regression model parameters. For example, a linear regression model y = f (x) +  = Wx +  is employed in PPCA where  is assumed to be the isotropic Gaussian noise [7]. Similarly, GPLVM uses the regression model Gaussian process regression (GPR) to represent the relationship between latent variables X and observation Y nonlinearly. Motivated by the dual representation of the PPCA model, GPLVM has been seen as an effective nonlinear dimensionality reduction technique. Basically, the idea behind GPLVM is identical to that of the dual PPCA and both of them try to optimize the marginal likelihood with respect to the latent variables X directly rather than the mapping matrix W in PPCA. By imposing a Gaussian process prior on the mapping function f in GPLVM and incorporating the Gaussian likelihood of the observed data set, marginalizing the mapping function results in the type-II maximum likelihood optimization with respect to the latent variables under the Bayesian learning framework. GPLVM usually captures nonlinearity in data in a better manner and offers greater potential in interpreting low dimensional latent variables. For example, the mechanism is well applied in the movement modeling and generation, see [21] and [22]. A similar latent variable model based on kernel machine, relevance units machine (RUM) rather than a GPR, called RULVM, has been proposed recently in [17]. From the prospective of representer theorem [23], RUM can be viewed as the sparse GPR which tries to directly learn the sparse support/relevance/critical vectors from the training data instead of to choose them from the training data in classical sparse GPR [24], [25] and typical Bayesian sparse model, relevance vector machine (RVM) [26]. Straightforwardly learning the support/relevance/critical vectors leads to superior sparsity and robustness to overfitting and underfitting. Additionally, for this kind of model the preimage problem can be easily handled and sometimes LVM-based models outperform the spectral models significantly. On the other hand, in order to learn latent variables from data sets with these models we have to resort to a time-consuming iterative optimization, such as the scaled conjugate gradient (SCG), EM algorithm, and so on, meaning that LVM-based models are somewhat slower than the spectral approaches. In this paper, motivated by well-known GPLVM, we develop a new latent variable model, called the thin plate spline latent variable model (TPSLVM). From the perspective of an

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES

LVM-based DR model, it can be seen as the LVM where the mapping from latent space to observation space is modeled by a thin plate spline (TPS) function [27]. Furthermore, based on the close link between TPS and GP (which will be briefly discussed in the next paragraph) it can also be interpreted under the framework of the GPLVM with a fairly peculiar covariance function. However, there are several differences here. First, TPSLVM directly takes advantage of kernel machine representation [17], and imposes an interpretable prior over the mapping function rather than a priori over the implicit function space in GPLVM. Second, we will have an explicit mapping relation between latent variables and highdimensional observation space while the standard GPLVM does not offer this facility, hence the inference becomes much simpler. The idea of TPSLVM stems from two motivations. First, TPS can be seen as a generalization of standard multivariate linear regression, where the parametric model is replaced by a suitably non-parametric function. Similarly, GP can also be represented in terms of a number of basis functions which is the feature of the representer theorem [23]. The connection between Kriging/GPR and TPS is discussed in [25], [28], and [29]. Although Kriging/GPR has close links with TPS, and empirical comparisons have suggested that their predictive performance is often similar to that of Kriging/GPR [30], Hutchinson and Gessler [30] cited many data sets in which splines and kriging predict almost equally well and suggested that splines, being automatic, may be more attractive than Kriging/GPR. Azari and Muller [31] and Wahba [32] also indicated that splines outperformed Kriging/GPR in some circumstances. For example, [32] claimed that in some data sets with highly correlated noise, splines are superior to Kriging/GPR because the apparent spatial correlations may be regarded as local trends, and the methods based on stationary stochastic processes cannot be right for some types of spatial structure [33]. However, other researchers demonstrated that Kriging/GPR outperforms TPS in certain scenarios [33]. Despite the debate, it is clear that Kriging/GPR and TPS can be equivalent in some cases [25]. Second, theoretically GPLVM is based on LVM to carry on dimensionality reduction. Indeed LVM is a powerful method in probabilistic settings, and it relates additional latent variables to a set of observed variables through a set of parameters. The latent variables are then marginalized and the parameters are found through maximizing the likelihood and GPLVM tries to marginalize random functions (non-parameters) and optimize the latent variables [16]. Based on the same idea used in GPLVM, TPSLVM can be developed. The main merit of the proposed TPSLVM is that it is more suitable than the well-known GPLVM for learning latent space with low dimensionality and/or especially with some kinds of shift and rotation embedded in high-dimensional observation data. Since one of main purposes of DR algorithms is to visualize data in low dimensional space such as in 2-D/3-D, TPSLVM will benefit this process. Early results of this research have been reported in the IJCNN conference [34]. The rest of the paper is organized as follows. In Section II, we will briefly review GPLVM.

1797

Section III will introduce TPSLVM, including the three extended forms. Then, real-world data sets will be used to verify and evaluate the performance of the newly proposed algorithm in Section IV. Finally, the concluding remarks and comments will be given in Section V. II. Gaussian Process Latent Variable Models GPLVM introduced in [16] is a nonlinear extension of the classical PPCA [7]. Before introducing our new TPSLVM, we firstly recall PPCA and GPLVM. Consider a given set of observed data which is already centered, i.e., with the zero mean, Y = [y1 , · · · , yN ]T ⊂ RD , where each datum yn is an observed vector in the D-dimensional space RD and define the latent variables of Y to be X = [x1 , · · · , xN ]T where xn ∈ Rq is the corresponding latent point of yn . Typically we have D  q. As we have mentioned earlier, PPCA simply assumes that each latent point xn is related to its corresponding observed variable yn through a linear mapping with some noise yn = Wxn + n

(1)

in which the linear mapping is parameterized by an unknown matrix W ∈ RD×q and n is the noise terms assumed to be of Gaussian distribution. The goal of this model becomes learning the latent variables X and the mapping W as well. Under a probabilistic framework, there exist several different specifications over the parameters and variables in the basic model (1). The classical PPCA takes the easiest strategy by imposing a unit Gaussian prior over the latent variable X P(X) =

N 

N (xn |0, I).

(2)

n=1

To avoid simultaneously learning both X and W, in PPCA, the latent variables X can be marginalized out based on its prior knowledge, giving rise to a marginal likelihood with respect to the unknown mapping matrix W evaluated as follows: P(Y |W, β) =

N 

N (yn |0, C)

(3)

n=1

with the covariance matrix given by C = WW T + β−1 I. Then, the unknown parameters W can be obtained by maximizing the marginal likelihood. It has been proved that there exists an analytic solution to this maximization [7]. Finally, once the mapping matrix W has been learnt, it would be easy to determine a backward mapping from the observed space to the latent space according to posterior of X, and obtain all the latent points with the mapping function. In the second strategy, the dual representation of PPCA (DPPCA) imposes a Gaussian prior on the mapping matrix W instead of the latent variables X in PPCA. Specifically the prior over the mapping W is defined by P(W) =

D 

N (wi |0, I)

(4)

i=1

where each parameter vector wi is the ith row of the matrix W with dimension q. That is, the rows of W independently

1798

follow the same unit Gaussian of dimension q. Once again marginalizing W out leads to the marginal likelihood with respect to the latent variables X which can be optimized straightforwardly. Under the assumption of the Gaussian data likelihood we can get the marginal likelihood for Y as follows:   D  1 1 T −1 P(Y |X, β) = (5) exp − Y:,i K Y:,i (2π)N/2 |K|1/2 2 i=1 where the kernel matrix is K = XXT + β−1 I and Y:,i is the ith column of Y . Interestingly this likelihood can be considered as a product of D independent Gaussian processes, each one being associated with a different dimension of the data set by using the linear kernel function. Typically we can use any (nonlinear) kernel function instead of the linear one for the D independent Gaussian processes, resulting in the classical GPLVM. Finally, if q < D a reduced dimensional representation of the data can be obtained. Since GPLVM performs very well in many real-world data sets, this model has become popular in many applications [16], [21], [35]. Meanwhile, many extensions of this model have been developed to further improve performance. The first kind of extension is the attempts at incorporating prior knowledge in the latent variables. For instance, the Gaussian process dynamical model (GPDM) [36] allows to model dynamics in the latent space. The back constraints version of GPLVM was proposed in [37] to constrain latent points to be a smooth function of the corresponding data points, which forces points which are close in data space to be close in latent space. Sparse coding [38] encourages sparsity of the latent representation. Bayesian Gaussian process latent variable model [39] is proposed to allow the Bayesian training for GPLVM, which is robust to overfitting, and is able to automatically infer the dimensionality of the latent space by using the automatic relevance determination (ARD). Another types of extended GPLVM are models that try to incorporate constraints on the generated outputs [40], [41], which can directly encode prior knowledge with respect to specific problems. In addition to the GPLVM extensions in unsupervised settings, there are many supervised forms of GPLVM by making use of extra label information to further improve the performance of DR models [42], [43]. III. Thin Plate Spline Latent Variable Models Motivated by GPLVM, we propose a novel LVM-based DR model, named thin plate spline latent variable models (TPSLVM), which is especially suitable for those data sets with lowly intrinsic dimensionality and with some kinds of shift and rotation. Many computer vision applications will benefit from this novel model. We will also find that TPSLVM can be viewed as GPLVM with special kernel function, and again the relation between GP and TPS will be explained in the full Bayesian framework. This section can be divided into six parts. First, the TPS model will be briefly reviewed, and then we will derive the TPSLVM model by relating the latent points to the corresponding observed images through TPS model straightforwardly instead of using a linear model

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

as in DPPCA. Subsequently, the scaled conjugate gradient (SCG) algorithm is applied to learn the low-dimensional latent variables. How to address the out-of-sample issue for TPSLVM will be investigated in the fourth subsection. Finally the fifth part will introduce two extended forms of TPSLVM, followed by a complexity analysis in the end. A. Classical Thin Plate Spline Model A TPS is a physically motivated 2-D interpolation scheme for arbitrarily spaced tabulated data (xn ; f (xn )) with xn ∈ R2 . These splines are the generalization of the natural cubic splines in 1-D [27]. The idea behind thin plate spline is to choose a function f (x) minimizing the bending energy ([44] and Wikipedia1 for TPS and the Demo2 for TPS geometrical interpretation)  2 2  2 2     2 2 ∂ f ∂ f ∂ f J= dx dy (6) +2 + 2 ∂x ∂x∂y ∂y2 and for more typically smoothing thin plate spline, it is JS =

N

yi − f (xi )2 + λJ

(7)

n=1

where xn , yn represents the n-th sample of N data, and λ is the smoothness parameter. For this variational problem, it can be shown that there exists a unique minimizer [27] with a fixed weight parameter λ which is presented later (10). Since it has been proven that the function J is invariant under under shift and rotation of the coordinates [44], [45], TPS has several desirable properties, which are as follows. 1) It is globally smooth, separable into affine and non-affine components, and contains the least possible non-affine warping component to achieve the mapping [46]. 2) It is well suited to low-dimensional smoothing problems, especially in modeling of 2-D and 3-D spatial data [29], [47]. 3) It is invariant under shift and rotation of the coordinates [48], [49]. For these reasons, it has been widely used in image registration and matching applications in order to model deformation [50]–[52] and coordinates warping [50], [53] among different images. For example, image registration aims to choose two sets of corresponding coordinate information in two images from different imaging techniques or the same imaging technique but different subjects. In other words, it is to map the pixel information obtained from image A to another corresponding image B (R2 → R2 ). The reason why TPS is employed is that it can match the corresponding coordinate information exactly, and keep the entire image deforming energy in minimum [54]. Given a data set D = {(x1 , y1 ), · · · , (xN , yN )} = 1 2 {(x11 , x12 , y1 ), · · · , (xN , xN , yN )}, where x1 , x2 , y are the scalar values. Let us define  1 √ γ 2 log γ (γ > 0) η(γ) = 8 π (8) 0 (γ = 0) 1 http://en.wikipedia.org/wiki/Thin

plate spline

2 http://elonen.iki.fi/code/tpsdemo/index.html

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES



1 C = [c1 , · · · , cN ] = ⎣x11 x12

··· ··· ···

1 x21 x22

⎤ 1 1⎦ xN . 2 xN

(9)

N

ai η(x − xi ) +

1

X = [x1 , · · · , xN ] = [xT1,: , · · · , xTq,: ]T ⊂ Rq×N

3

i=1

bj j (x)

(10)

j=1 2

1

2

where x = (x , x ), the function 1 (x , x ) = 1, 2 (x1 , x2 ) = x1 , 3 (x1 , x2 ) = x2 , and the parameters a = [a1 , a2 , · · · , aN ]T , b = [b1 , b2 , b3 ]T . In order for s(x1 , x2 ) to have square integrable second derivatives, we require that N

i=1

ai = 0 and

N

i=1

ai xi1 =

To introduce the multitask TPS model, we need to define some new notation as follows: Y = [y1 , · · · , yN ] = [yT1,: , · · · , yTD,: ]T ⊂ RD×N

The TPS model is defined as follows: y = s(x) =

1799

N

ai xi2 = 0.

(11)

i=1

Thus modeling a TPS is equivalent to learning the parameters a and b from data. Let us further define E = [Eij ]N i,j=1 with 1 Eij = η(ci − cj ) = √ xi − xj 2 logxi − xj  8 π and then (10) together with the boundary conditions (11) can be further rewritten as      E CT a y = (12) C 0 b 0 with y = [y1 , y2 , · · · , yN ]T . Therefore, the parameters a and b can be figured out by solving linear equations.

E = [e1 , · · · , eN ] ⊂ RN×N with ej = [η(xj − x1 ), ..., η(xj − xN )]T C = [c1 , · · · , cN ] = [X; 1] ⊂ R(q+1)×N A = [a1 , · · · , aN ] = [aT1,: , · · · , aTD,: ]T ⊂ RD×N B = [b1 , · · · , b(q+1) ] = [bT1,: , · · · , bTD,: ]T ⊂ RD×(q+1) where uk,: means a row vector and the multidimensional basis function η(γ) is defined as [55] ⎧ (−1)q/2+1 ⎪ ⎨ √ γ 4−q ln γ if q = 2 or 4 8 π(2 − q/2)! η(γ) = ⎪ ⎩ (q/2 − 2) γ 4−q otherwise 16πq/2 thus we can write the multitask TPS models by  q+1 si = s(xi ) = N j=1 bj j (xj ) j=1 aj η(xi − xj ) + = Aei + Bci . (13) Note that here the input xi is the vector in the latent space with dimension q, rather than the scale in the original TPS (10). Motivated by the same idea used in DPPCA and GPLVM [16], TPSLVM will be developed by relating the latent variables to the corresponding observed data through TPS model instead of linear model in GPLVM. First, we define the likelihood of the data in the following way: P(Y |X, A, B) =

B. TPS-Based Latent Variable Models In unsupervised DR settings, the observed data is of high dimension, and the goal of LVM-based DR models is to find the corresponding low-dimensional latent variables embedded in the high-dimensional observation data by modeling the mapping from the unknown latent space to the observed space, which is also unknown. Indeed, there exists an infinite amount of solutions to the mapping function and the latent variables. GPLVM is initially derived by assuming that the mapping from latent space to observation space is a linear function, and then the kernel trick is applied to allow the model to implicitly encode the prior knowledge about the unknown mapping function. However, in this part we will try to directly model the mapping function by TPS, because we already have acquired the special properties of TPS III-A. It is expected that TPSLVM which tries to learn the latent variables from the TPS model with unknown input, will be more suitable than typical GPLVM in data sets where the intrinsic dimensionality is low, and especially if there exists some kind of translation and rotation. First, since the output of LVM is high-dimensional, we need to introduce the multitask extension of TPS model. Following the same notations in GPLVM, let us define the latent variables X = [x1 , · · · , xN ] with xi ∈ Rq be the latent counterparts of the (centered) observed data set, Y = [y1 , · · · , yN ] where each yi is an observed vector in the D-dimensional space RD .

N 

N (yi |si , β−1 I)

(14)

i=1

where si = s(xi ) is the multitask TPS model (13) that maps the low-dimensional latent points to the corresponding observed samples with high dimension. Similarly as what we have done in GPLVM, Gaussian priors are placed over parameters A and B in the multitask TPS model as follows: P(A) =

D 

N (aj,: |0, I) and P(B) =

j=1

D 

N (bj,: |0, I)

(15)

j=1

with aj,: ∈ RN and bj,: ∈ Rq+1 . Subsequently we marginalize out the parameters A and B, giving rise to the marginal likelihood (refer to [16] for similar derivations in detail)  P(Y |X, β) = P(Y |X, A, B, β)P(A)P(B)dAdB    1 exp − 21 tr YK−1 Y T = (16) DN/2 D/2 (2π) |K| with covariance matrix defined by K = β−1 I + ET E + CT C.

(17)

The above marginal likelihood of TPSLVM (16) looks like that of GPLVM/DPPCA (5) in GPLVM formulation, except for a different form of covariance matrix K. However, the

1800

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

covariance matrix in TPSLVM separates the linear part C and the high order nonlinear part E explicitly. Since TPS is more powerful in low-dimensional space, it is expected that with this novel covariance function, TPSLVM will show better performance than GPLVM with typical covariance functions, such as RBF and MLP in data sets with low intrinsic dimensionality, which is very common in many computer vision applications.

To further constrain the latent variable X, we can place the Gaussian prior over X, leading to the MAP solution. Let us assume the prior over X is P(X|γ) =

N 

N (xi |0, γ −1 I)

(24)

i=1

and then by combining it with the marginal likelihood, we can evaluate the posterior distribution over X as follows:

C. Algorithm Derivation Given the marginal likelihood we can use the type-II maximum likelihood method in order to learn the latent variable X directly. This is equivalent to finding the maximum of L = log P(Y |X, β) with respect to X by an optimization solver such as the SCG method. Now consider the log likelihood L = log P(Y |X, β) of (16)  DN D 1  log(2π) − log|K| − tr YK−1 Y T . (18) 2 2 2 It is easy to compute the partial derivatives of L with respect to X. We first compute the gradients of the log likelihood L with respect to the covariance matrix K, and then combine it ∂K with to obtain the partial derivatives of L with respect ∂Xmn to X through the chain rule. First of all the gradients of L with respect to K is L=−

 ∂L 1  −1 T (19) = K Y YK−1 − DK−1 ∂K 2 and, for the specific basis function η used in TPSLVM in the case of q = 2, the gradients of K w.r.t. Xmn is (m = 1, · · · , q; n = 1, · · · , N) ∂K ∂E ∂C ∂ET ∂CT = E + ET + C + CT ∂Xmn ∂Xmn ∂Xmn ∂Xmn ∂Xmn

(20)

∂E is an N × N matrix with all the elements being ∂Xmn zero except that its nth row and nth column are the vector where

[2(Xmn − Xm1 )(logxn − x1  + 1), · · · 2(Xmn − XmN )(logxn − xN  + 1)]

(21)

T

∂C C is a (q + 1) × (q + 1) matrix with all the elements ∂Xmn ∂C being zero except that its (m, n)-element is 1. Thus, CT ∂Xmn is an N × N matrix with all the elements being zero except that its n-column is the vector [Xm1 , · · · , XmN ]. Similarly, for the parameter β in the covariance matrix, the gradient can be written by and

 ∂L 1  = − tr (K−1 Y T YK−1 − DK−1 )β−2 . ∂β 2

(22)

Note that we use the chain rule of derivatives of structured matrices [56] throughout the derivation ⎡ ⎤ T ∂t(E) ∂t(f (X)) ∂t(E) ∂E ⎦. = = tr ⎣ (23) ∂Xmn ∂Xmn ∂E ∂Xmn

P(X|Y, β, γ) ∝ P(Y |X, β)P(X|γ).

(25)

Note that in the process of adding constraints/prior, we do not have to marginalize the latent variables X, which therefore can be seen as adding a regularizer to TPSLVM, and also opens the door for us to utilize other constraints, such as dynamics and back constrains, to further improve the performance of TPSLVM, see Section III-E. In our experiments, we learn the latent variables of original TPSLVM so as to maximize the posterior distribution over latent variable X by using the SCG algorithm. The code is based on Neil D. Lawrence’s MATLAB packages Optimi.3 D. Extensions To further incorporate prior knowledge about the latent manifold we can impose an application specific prior to assist learning latent embeddings in TPSLVM. As examples, we investigate two TPSLVM extensions, the TPSLVM with dynamic (TPSLVM-DM) and back-constraint TPSLVM (BC-TPSLVM). When a dynamic structure is added to the latent space, TPSLVM will be able to handle the time series data. In addition, with back constraints TPSLVM can force points which are close in data space to be close in latent space. Finally, we can also use the two extensions simultaneously to fit the data set effectively. We will verify the two extensions in the experiments section. As a general rule, any kind of constraints can be imposed on the latent variables X through an appropriate prior which is practically meaningful. If we do not marginalize the latent variables X, the introduction of priors over the latent variables results in some significant extended TPSLVMs. We have seen the first example in (24) and (25) where we impose a Gaussian prior over X. Optimizing latent variables X from this model gives rise to the MAP solution. Alternatively, since we do not have to marginalize latent variables X when solving the original TPSLVM, we can also add dynamics to the latent space simply by imposing a special sequential prior on X instead of the Gaussian distribution, leading to the TPSLVM with dynamics (TPSLVM-DM). In principal, the idea of TPSLVM-DM is identical to that of GPLVM with dynamics (GPLVM-DM) recently developed in [36] except that different kernels are used. GPLVM-DM tries to add dynamics to GPLVM by assuming that data is presented in temporal order. Initially, a Markov chain distribution over the latent space is defined by P(xn |xn−1 ), which results in a 3 http://ml.sheffield.ac.uk/˜neil/fgplvm

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES

prior distribution

1801

Algorithm 1 Training and Testing TPSLVM

P(X) = P(x1 )

N 

P(xn |xn−1 ).

(26)

i=2

Of course, the marginal likelihood P(Y ) can be obtained by combining this prior with P(Y |X), but this is generally intractable. However, the MAP estimation can be obtained in this combination. Furthermore, rather than the simple Markov chain, [36] suggests using GP to relate xn to xn−1 . With the GP prediction at each time step, the joint likelihood over the latent variables X and the observation Y is given by  D DN 1  log 2π − log|K| − tr K−1 Y T Y 2 2 2   qN 1  −1  q ˆ X ˜ X− ˆ X ˜ T − log 2π − log|Kx |− tr Kx X− 2 2 2 P(Y, X) = −

(27) ˆ = [x2 , · · · , xN ]T and X ˜ = [x1 , · · · , xN−1 ]T . The where X kernel Kx is the dynamics covariance matrix associated with ˜ Thus, the dynamics GP and is constructed on the matrix X. learning the dynamic model from observation Y is equivalent to minimizing the negative log posterior − log P(X|Y ) ∝ − log P(Y, X) with respect to latent variables X. Furthermore, although our original TPSLVM provides a smooth mapping from latent space to the observation space, it only guarantees that samples close in latent space will be close in data space. However, like the potential problem in GPLVM, points close in data space may be not close in latent space. In order to address this issue, inspired by the BC-GPLVM [37], we propose the use of back constraints in TPSLVM (BC-TPSLVM). The idea behind this kind of model is to constrain latent points to be a smooth function of the corresponding data points, which forces points which are close in data space to be close in latent space. The back-constrain smooth function can be written by xmn = fm (yn , α)

(28)

where α are parameters of the smooth function. Typically, we can use a linear model, a kernel based regression (KBR) model or a multilayer perception (MLP) model to represent this function. As the function fm is fully parameterized in terms of a set of new parameters α, the learning process becomes maximizing the likelihood [(5)] with respect to the latent variables X and parameters α. With the same kinds of back-constrain functions, BC-TPSLVM can also be developed, and it involves computing ∂L/∂α via chain rule. Finally, we can also combine the priors of the backconstraint and dynamics together to introduce the new model back-constraint TPSLVM with dynamics (BC-TPSLVM-DM).

Input: High-dimensional training inputs Y ⊂ RD×N , prefixed latent dimensionality p, and high-dimensional testing data y∗ ⊂ RD ,. Output: s = {X, x∗ }. 1: Initialize latent variables X = PPCA(Y, p), and the parameter of kernel function β = [1]; 2: Optimize {X, β} = argmaxX,β log P(Y |X, β); 3: Initialize latent variables w.r.t. testing data {x∗ = xi } = argmaxi log p(yi , y∗ )|N i=1 ; 4: Optimize {x∗ } = argmaxx∗ log p(y∗ |x∗ , X, Y ); 5: return X, x∗ There are two ways to address this issue for TPSLVM. First of all, recall in previous subsection the BC-TPSLVM is introduced where the mapping from the high-dimensional observation space to the latent space has been explicitly modeled (28), so given a new testing observation y∗ its lowdimensional latent representation can be evaluated by (28) straightforwardly. This is a simple approach to handle the outof-sample problem. Alternatively we can take advantage of the fact that the TPSLVM is probabilistic without making use of the back constraints. As the mapping from the latent space to the observation space is modeled by GPR with TPS kernel, for a new observation y∗ the posterior distribution P(y∗ |x∗ , X, Y ) can be written directly in terms of GP predictive distribution P(y∗ |x∗ , X, Y )=N (y∗ |Y T K−1 kX,∗ , (k∗∗ −kTX,∗ K−1 kX,∗ )I) (29) where kX,∗ is a column vector computed by the kernel function between the learnt latent variables X and the new testing point x∗ . Following ideas discussed in [16] it is better to firstly initialize the corresponding latent representation x∗ by PCA or more effective approximation approach {x∗ = xi } = argmaxi log p(yi , y∗ )|N i=1 . Based on this testing method the training and testing algorithm for TPSLVM is shown in Algorithm 1. F. Complexity Analysis Like the original GPLVM, the proposed TPSLVM also requires an inverse of the covariance matrix in time O(N 3 ) in each gradient evaluation step, rendering the approach impractical if the number of training data N is large, so we have to find an efficient approximation to handle large data sets. We note that the spare approximation developed for GPLVM [24], [57] can be adapted into TPSLVM to further reduce the computational complexity. Developing a more practical algorithm will be our further work.

E. Learning Latent Variables for Testing Data It is straightforward to learn the low-dimensional latent variables for training data in TPSLVM. However, in the application of DR techniques, it is more important to find a low-dimensional representation x∗ ∈ Rq for a given new observation y∗ ∈ RD . This is the so-called out-of-sample problem.

IV. Experiments In this section, we compare the proposed TPSLVM (including its extensions) with the classic spectral approaches, PCA, Sammon Mapping, ISOMAP, LLE, and KPCA, plus the state-of-the-art LVM-based model GPLVM, to illustrate

1802

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Fig. 1. Oil flow data set visualized by PCA, KPCA, GPLVM (with RBF kernel) and TPSLVM, respectively. (a) PCA. (b) KPCA. (c) GPLVM. (d) TPSLVM.

the performance of TPSLVM in night data sets. In order to assess the performance of these models in visualizing highdimensional data sets, we perform dimensionality reduction by using a 2-D latent space for visualization. Moreover, the nearest neighbor classification error is tested in the lowdimensional latent space to objectively evaluate the quality of visualization. More importantly, three computer vision applications are employed to testify our conclusion that TPSLVM is more suitable than GPLVM with typical kernel functions in data sets where the latent space is of low dimension and there exists some kind of translation and rotation. Finally, in order to further compare the performance of the proposed TPSLVM with GPLVM, we learn the DR model from the training data and then the testing data are mapped to the 2-D latent space to perform kNN classification for comparison in three data sets. The experimental results verify that the proposed TPSLVM is an efficient LVM-based DR model and outperforms GPLVM when the dimensionality of the latent space is low. For a fair comparison, we run 1000 iterations for all the models, and the kernels used for KPCA are RBF, and for GPLVM are RBF, MLP, and Poly in Neil D. Lawrence’s MATLAB packages Kern, among which the best kernel function will be chosen manually. Since the parameters of kernel for KPCA also needs to be selected manually, we varied it between 0.3 and 3.75. For ISOMAP and LLE, we choose the number of neighborhoods from 2 to 20 optimally. The initializations for latent variable X in both TPSLVM and GPLVM are X = PPCA(Y ) by using the code in Neil D. Lawrence’s FGPLVM packages. The codes of GPLVM and the spectral approaches are based on Neil D. Lawrence’s MATLAB packages FGPLVM,4 and Laurens van der Maaten’s MATLAB Toolbox for Dimensionality Reduction [1], respectively. Note that we only apply the spectral approaches PCA, Sammon Mapping, and KPCA to the first three data 4 http://ml.sheffield.ac.uk/˜neil/fgplvm

Fig. 2. Iris data set is visualized by (a) KPCA, (b) GPLVM (with RBF kernel) and (c) TPSLVM, and Swiss roll data set visualized by (d) Sammon Mapping, (e) GPLVM (with MLP kernel) and (f) TPSLVM. (a) KPCA for Iris. (b) GPLVM for Iris. (c) TPSLVM for Iris. (d) Sammon Mapping for Swiss Roll. (e) GPLVM for Swiss Roll. (f) TPSLVM for Swiss Roll.

sets as both ISOMAP and LLE implemented in Maaten’s package will reduce the number of data points in order to avoid numerical un-stability, so we cannot compare the nearest neighbor classification error fairly. Subsequently, we compare TPSLVM with all the five classic spectral approaches PCA, Sammon Mapping, ISOMAP, LLE, KPCA and GPLVM in three real-world computer vision tasks, which can illustrate the experimental results more clearly. Moreover, we verify the extended TPSLVMs, BC-TPSLVM and TPSLVM-DM, for the next two data sets. Finally, three data sets are utilized to further testify the performance of TPSLVM when the dimensionality of latent space is low. A. Oil Flow Data The oil flow data set [11] consists of 12 dimensional measurements of oil flow within a pipeline. There are three phases of flow associated with the data and 1000 samples in the data set. For all seven models, we use 600 samples (200 points from each class) to learn the corresponding 2-D latent data for the purpose of data visualization. For GPLVM, RBF kernel is superior to MLP and Poly kernels. As can be seen from Fig. 1, the LVM-based DR approaches are more effective than the spectral models, and TPSLVM is superior to GPLVM remarkably because the developed TPSLVM makes the points in the latent space which belong to the same class in the original feature space much closer than GPLVM.

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES

1803

TABLE I Nearest Neighbor Classification Error

B. Iris Data The Iris data set [58] contains three classes of 50 instances each, where each class refers to a type of iris plant. There are four features for each instance. All 150 data points are utilized to learn the 2-D latent space. GPLVM with RBF kernel function achieves the best results than that with other two kernels. Fig. 2(a)–(c) shows the same conclusion as stated in the oil flow data set. C. Swiss Roll Data The Swiss roll data set [5] is often used to illustrate DR algorithms. It is the 3-D data set which has intrinsic dimension of two. Here we use 400 samples, which is created by randomly sampling from a Gaussian mixture model with four different means, to visualize the data in 2-D space. For GPLVM, the optimal kernel function is MLP instead of RBF. What can be seen from Fig. 2(d)–(f) is that the developed TPSLVM outperforms all other models significantly in this data set. Furthermore, to objectively evaluate the quality of visualization, for each of data sets aforementioned, we compare the class of each data point in the data set with the class label of its nearest neighbor in the 2-D latent-space supplied by each method. Table I tells us that TPSLVM outperforms GPLVM with the optimal kernel function among RBF, MLP and Poly kernels, and the classic spectral approaches especially for the Swiss roll data, while there are slightly more errors given by TPSLVM than GPVLM and KPCA in Iris data set. Note that Sammon Mapping fails in learning latent variables for the Iris data set. In order to illustrate the experimental results more clearly, we further compare TPSLVM and its extensions with GPLVM and the five spectral models in the following three real-world computer vision tasks. In order to verify our conclusion that TPSLVM is more suitable than GPLVM with typical kernel functions in latent space with low dimensionality and especially the data set exhibiting some kind of translation and rotation, the COIL-20 data set [59], the UMIST Face data set [60], and the Teapots data set [61] are chosen, because TPS is a kind of model which is well suited to low-dimensional problem and rotationally invariant. D. COIL-20 Data In this experiment we use the duck object from the COIL-20 data set with different views taken along a view circle. Therefore the intrinsic dimension of this set is one. The object of duck contains 71 views, and each view is 128 × 128 grayscale image. All images are viewed as the 16384-dimensional vectors for the seven DR models, and the latent dimension is 2

Fig. 3. Coil data set visualized by ISOMAP, LLE, GPLVM (with RBF kernel), and TPSLVM, respectively. (a) ISOMAP. (b) LLE. (c) GPLVM. (d) TPSLVM.

as well. It can be seen from Fig. 3, the spectral approaches are superior to the LVM-based models. However, the result from TPSLVM is better than that from GPLVM because the manifold curves constructed by TPSLVM are smoother than the results from GPLVM, which demonstrates the fact that the proposed TPSLVM is more effective than GPLVM especially when the intrinsic dimensionality of the latent space is very low. E. UMIST Face Data The UMIST face data set [60] consists of 564 face images of individuals (mixed race/gender/appearance) with the resolution of 112 × 92. Each person is shown in a range of poses from profile to frontal views. Thus the intrinsic dimension for each subject is also one. Similarly, each image is considered as a vector with dimensionality of 10304, the 2-D latent spaces are learnt by TPSLVM, GPLVM and the spectral methods (LLE is chosen since it performs better than PCA, ISOMAP, Sammon Mapping, and KPCA in this experiment) for comparison. Fig. 4 shows that TPSLVM outperforms both GPLVM and the spectral methods remarkably especially when different kinds of subjects exist. This can be seen from the visualizations for the first two subsets consisting of the first one and first two subjects images, respectively, where TPSLVM and LLE successfully discover the rotation in the 2-D space. In contrast, GPLVM fails in discovering this intrinsic property of the data set. However, when multisubjects are used, TPSLVM is still able to discriminate different objects, while LLE and GPLVM cannot separate each person clearly. Again this argument is verified in the last subset with the first 6 subjects faces and 162 images. F. Teapots Data The teapots data set includes 400 RGB color images in a resolution of 76×101 taken by viewing 360 degrees of a rotating teapot [61]. Similarly, the intrinsic dimension of this set is

1804

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Fig. 4. Visualization Comparison for UMIST Face data set by different DR algorithms with different dataset settings. The best kernel function for GPLVM in this data set is RBF. (a) LLE with 1 face and 38 images, (b) LLE with 2 faces and 63 images. (c) LLE with 6 faces and 162 images. (d) GPLVM with 1 face and 38 images. (e) GPLVM with 2 faces and 63 images. (f) GPLVM with 6 faces and 162 images. (g) TPSLVM with 1 face and 38 images. (h) TPSLVM with 2 faces and 63 images. (i) TPSLVM with 6 faces and 162 images.

one as well. In this experiment, we use 100 images to compare TPLSVM and GPLVM families and other five spectral DR models. Also each image is treated as a 23028-dimensional vector, and the dimensionality is reduced to two for visualization. For the models of GPLVM-DM and TPSLVM-DM, the GP dynamics is applied in the latent space with the RBF plus white noise kernel. Moreover, the back constrains function used in BC-GPLVM-DM and BC-TPSLVM-DM is KBR. It can be seen from Fig. 5 that, as what we have seen in the results for the COIL-20 data set, the spectral models ISOMAP and LLE [Fig. 5(a) and Fig. 5(b)] can find the embedded 2-D manifold exactly. In contrast, both original GPLVM and TPSLVM [Fig. 5(c) and Fig. 5(f)] cannot obtain the intrinsic manifold precisely, although the manifold curve from original TPSLVM is smoother than that from original GPLVM. Further, we tested TPSLVM-DM and BC-TPSLVM-DM for this data set. Fig. 5(d) and (e) and Fig. 5(g) and (h) tell us that the extended TPSLVMs significantly improve the performance of the original TPSLVM and successfully find the property of rotation embedded in the data set especially in the case when we use both the back-constrain and dynamics conditions for

TPSLVM. By contrast, the extended GPLVMs failed in finding the intrinsic manifold. G. Vowel Data For fair comparison, in this experiment we verify BC-TPSLVM in the single speaker vowel data set [62], which is used to illustrate the performance of BC-GPLVM. This data set includes the cepstral coefficients and deltas of ten distinct vowel phonemes and is acquired as part of a vocal joystick system [62]. The main reason that why we use this data set is that the original GPLVM and TPSLVM without back constraints would fail to separate the data. Actually the non-back constrained model tends to fragment the distinct vowels. In contrast, the results from back constrained model keep alike vowels closer together. Throughout this experiment, 900 points are randomly selected from 2700 data with nine categories where 100 samples for each category is picked. The back-constrain smooth function for both BC-TPSLVM and BC-GPLVM is MLP. Fig. 6 depicts the visualization results on Vowel data using different spectral methods, among which ISOMAP and LLE show a good performance, whereas PCA, Sammon Mapping

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES

1805

Fig. 6. Visualization results for vowel data set under different DR algorithms and schemes. The best kernel function for GPLVM in this data set is MLP. (a) ISOMAP. (b) GPLVM. (c) BC-GPLVM. (d) LLE. (e) TPSLVM. (f) BC-TPSLVM.

TABLE II Classification Accuracy Rates in 2-D Latent Space

Fig. 5. Visualization comparison between different DR algorithms with structural priors. The best kernel function for GPLVM in this data set is Poly. (a) ISOMAP. (b) LLE. (c) BC-GPLVM. (d) GPLVM-DM. (e) BC-GPLVMDM. (f) BC-TPSLVM. (g) TPSLVM-DM. (h) BC-TPSLVM-DM.

and KPCA fail in finding the low-dimensional structure embedded in this data set. The LVM-based models GPLVM and TPSLVM perform badly if no back constraint is applied. However, when back constraint is added to GPLVM and TPSLVM, their performances are improved significantly. We can also find that BC-TPSLVM separates away each class in this data set more clearly than BC-GPLVM. H. Other Data Sets In order to further compare the performance of the proposed TPSLVM with GPLVM, we learn the DR model from the training data and then the testing data are mapped to the 2-D latent space to perform kNN classification. The higher classification accuracy rates indicate the better performance of the DR models.

Firstly the wine data set [58] consisting of three classes with 178 instances in R13 is testified. 90 data points (30 data for each class) are picked randomly from the data set as training data and the remaining 88 data points are testing data. Another data set is the USPS Data [63] consisting of grayscale handwritten digit images from U.S. Postal Service. The images are of size 16×16, with pixel values in the range zero to one. In this experiment, we try to distinguish digits 0, 1, 2, 3, and 4 with 250 training samples (50 data for each class), and then compare TPSLVM with GPLVM in 2308 testing samples in terms of the classification errors. Finally, the oil flow data is reused with randomly choosing 300 training data and 700 testing data. Table II shows the comparison of the classification accuracy rates. It can be seen that TPSLVM outperforms GPLVM in low-dimensional latent space, which verifies that TPSLVM is more powerful than GPLVM in low-dimensional latent space.

1806

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

V. Conclusion and Future Work In this paper, we developed a novel latent variable model based on the thin plate splines, termed TPSLVM. It has strong association with the so-called GPLVM. We conclude that the proposed TPSLVM can be interpreted under the framework of GPLVM with a fairly peculiar covariance function. TPSLVM can be seen as using TPS to non-linearly model the mapping from the (unknown) low-dimensional latent space to highdimensional observation space. The property of shift and rotation invariance of TPS empowers TPSLVM this property so that it is able to exactly model the intrinsic shift and rotation structures embedded in high-dimensional observation. Besides, compared to the well-known GPLVM, TPSLVM is better suited to low-dimensional smoothing problems, such as data visualization in 2-D/3-D space. In addition, we also extended TPSLVM to BC-TPSLVM and TPSLVM-DM. The experimental results show that the proposed TPSLVM and its extensions outperform GPLVM significantly. For the further work, there are two possible directions that are worth trying. Firstly, inspired by our recent work in supervised DR [43], the supervised extension of TPSLVM can be developed to further improve the performance of the novel DR model. Secondly, based on the properties of invariance under shift and rotation TPSLVM and its extensions can be applied in computer vision tasks, such as face recognition and modeling human motions, etc. References [1] L. van der Maaten, E. Postma, and H. van den Herik, “Dimensionality reduction: A comparative review,” Tilburg Univ., Tilburg, The Netherlands, Tech. Rep., 2009. [2] X. Li and L. Shu, “Kernel based nonlinear dimensionality reduction for microarray gene expression data analysis,” Expert Syst. Appl., vol. 36, no. 4, pp. 7644–7650, May 2009. [3] G. Rosman, M. M. Bronstein, A. M. Bronstein, and R. Kimmel, “Nonlinear dimensionality reduction by topologically constrained isometric embedding,” Int. J. Comput. Vision, vol. 89, no. 1, pp. 56–68, 2010. [4] I. Rish, G. Grabarnik, G. Cecchi, F. Pereira, and G. J. Gordon, “Closedform supervised dimensionality reduction with generalized linear models,” in Proc. ICML, 2008, pp. 832–839. [5] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, Dec. 2000. [6] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, Dec. 2000. [7] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” J. Roy. Statist. Soc., Ser. B, vol. 61, no. 3, pp. 611–622, 1999. [8] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher discriminant analysis with kernels,” in Proc. IEEE Signal Process. Soc. Workshop, 1999, pp. 41–48. [9] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis. London, U.K.: Academic Press, 1979. [10] J. W. J. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Trans. Comput., vol. C-18, no. 5, pp. 401–409, May 1969. [11] B. Scholkopf, A. Smola, and K. R. Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neur. Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [12] X. He and P. Niyogi, “Locality preserving projections,” in Proc. NIPS, 2003. [13] S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimensionality reduction with local spline embedding,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1285–1298, Sep. 2009. [14] F. Nie, D. Xu, I. W.-H. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Trans. Image Process., vol. 19, no. 7, pp. 1921–1932, Jul. 2010.

[15] D. Luo, C. Ding, F. Nie, and H. Huang, “Cauchy graph embedding,” in Proc. ICML, 2011. [16] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” J. Mach. Learn. Res., vol. 6, pp. 1783–1816, Dec. 2005. [17] J. Gao, J. Zhang, and D. Tien, “Relevance units latent variable model and nonlinear dimensionality reduction,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 123–135, Jan. 2010. [18] Y. Bengio, J. F. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet, “Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering,” in Proc. Adv. NIPS, 2003. [19] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang, “Embedding new data points for manifold learning via coordinate propagation,” Knowl. Inf. Syst., vol. 19, no. 2, pp. 159–184, 2009. [20] P. Arias, G. Randall, and G. Sapiro, “Connecting the out-of-sample and pre-image problems in kernel methods,” in Proc. IEEE Conf. CVPR, 2007. [21] C. H. Ek, P. H. S. Torr, and N. D. Lawrence, “Gaussian process latent variable models for human pose estimation,” in Proc. Int. Conf. Mach. Learn. Multimodal Interact., 2007, pp. 132–143. [22] N. D. Lawrence and A. J. Moore, “Hierarchical Gaussian process latent variable models,” in Proc. ICML, 2007, pp. 481–488. [23] B. Sch¨olkopf and A. J. Smola. (2002). Learning with Kernels [Online]. Available: http://www.learning-with-kernels.org [24] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes using pseudo-inputs,” in Proc. Adv. NIPS, 2006, pp. 1257–1264. [25] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [26] M. E. Tipping and A. Smola, “Sparse Bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, Jun. 2001. [27] G. Wahba, Spline Models for Observational Data. Philadelphia, PA, USA: SIAM, 1990. [28] N. A. C. Cressie, Statistics for Spatial Data. New York, NY, USA: Wiley, 1993. [29] D. W. Nychka, “Spatial-process estimates as smoothers,” in Smoothing and Regression. New York, NY, USA: Wiley, 2000, pp. 393–424. [30] M. Hutchinson and P. Gessler, “Splines—More than just a smooth interpolator,” Geoderma, vol. 62, pp. 45–67, Mar. 1994. [31] A. S. Azari and H.-G. Muller, “Preaveraged localized orthogonal polynomial estimators for surface smoothing and partial differentiation,” J. Amer. Statist. Assoc., vol. 87, no. 420, pp. 1005–1017, 1992. [32] G. Wahba, “Comment on Cressie,” Amer. Statist., vol. 44, no. 3, pp. 255–256, 1990. [33] G. M. Laslett, “Kriging and splines: An empirical comparison of their predictive performance in some applications,” J. Amer. Statist. Assoc., vol. 89, no. 426, pp. 391–400, 1994. [34] X. Jiang, J. Gao, D. Shi, and T. Wang, “Thin plate spline latent variable models for dimensionality reduction,” in Proc. IJCNN, 2012. [35] B. Ferris, D. Fox, and N. Lawrence, “WiFi-SLAM using Gaussian process latent variable models,” in Proc. IJCAI, 2007. [36] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical models,” in Proc. Adv. NIPS, 2005, pp. 1441–1448. [37] N. D. Lawrence and J. Quinonero-Candela, “Local distance preservation in the GPLVM through back constraints,” in Proc. ICML, 2006, pp. 513–520. [38] J. Yang, K. Yu, and T. Huang, “Efficient highly over-complete sparse coding using a mixture model,” in Proc. ECCV, 2010. [39] M. K. Titsias and N. D. Lawrence, “Bayesian Gaussian process latent variable model,” in Proc. Int. Conf. AISTATS, 2010. [40] M. Salzmann and R. Urtasun, “Implicitly constrained Gaussian process regression for monocular non-rigid pose estimation,” in Proc. Adv. NIPS, 2010, pp. 1441–1448. [41] A. Varol, M. Salzmann, P. Fua, and R. Urtasun, “A constrained latent variable model,” in Proc. IEEE Conf. CVPR, 2012. [42] R. Urtasun and T. Darrell, “Discriminative Gaussian process latent variable model for classification,” in Proc. ICML, 2007, pp. 927–934. [43] X. Jiang, J. Gao, T. Wang, and L. Zheng, “Supervised latent linear Gaussian process latent variable model for dimensionality reduction,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 6, pp. 1620–1632, Dec. 2012. [44] J. Duchon, “Splines minimizing rotation-invariant semi-norms in Sobolev spaces,” in Constructive Theory of Functions of Several Variables, vol. 571. Berlin, Germany: Springer, 1976, pp. 85–100. [45] C. Gu and G. Wahba, “Semiparametric analysis of variance with tensor product thin plate splines,” J. Roy. Statist. Soc., Ser. B, vol. 55, no. 2, pp. 353–368, 1993.

JIANG et al.: TPSLVM: DIMENSIONALITY REDUCTION ALGORITHM BASED ON THIN PLATE SPLINES

[46] B. Brown and S. Rusinkiewicz, “Non-rigid range-scan alignment using thin-plate splines,” in Proc. Symp. 3D Data Process., Visualiz., Transmiss., 2004. [47] N. Grevstad, “Image alignment with thin-plate splines,” Purdue Univ., West Lafayette, IN, USA, Tech. Rep., 2003. [48] C. C. Holmes and B. K. Mallick, “Generalized nonlinear modeling with multivariate free-knot regression splines,” J. Amer. Statist. Assoc., vol. 98, no. 462, pp. 352–368, 2003. [49] J. C. McCall and M. M. Trivedi, “Pose invariant affect analysis using thin-plate splines,” in Proc. ICPR, 2004. [50] Z. Barbara and F. Jan, “Image registration methods: A survey,” Image Vision Comput., vol. 21, no. 11, pp. 977–1000, 2003. [51] D. Yang and S. Andrzej, “A low-dimensional local descriptor incorporating TPS warping for image matching,” Image Vision Comput., vol. 28, no. 8, pp. 1184–1195, 2010. [52] J. Yang, J. P. Williams, Y. Sun, R. S. Blum, and C. Xu, “A robust hybrid method for nonrigid image registration,” Pattern Recognit., vol. 44, no. 4, pp. 764–776, 2011. [53] G. Donato and S. Belongie, “Approximation methods for thin plate spline mappings and principal warps,” Oct. 2002. [54] Y.-C. Tsai, H.-D. Lin, Y.-C. Hu, C.-L. Yu, and K.-P. Lin, “Thin-plate spline technique for medical image deformation,” J. Med. Biol. Eng., vol. 20, no. 4, pp. 203–10, 2000. [55] D. Eberly. (2008). Thin-Plate Splines [Online]. Available: www.geometrictools.com/Documentation/ThinPlateSplines.pdf [56] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, v. 20081110. Cambridge, MA, USA: MIT Press, 2008. [57] N. D. Lawrence, “Learning for larger datasets with the Gaussian process latent variable model,” in Proc. Int. Conf. AISTATS, 2007, pp. 243–250. [58] A. Frank and A. Asuncion. (2010). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml [59] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” Columbia Univ., New York, NY, USA, Tech. Rep., 1996. [60] D. B. Graham and N. M. Allinson, “Characterizing virtual eigensignatures for general purpose face recognition,” Face Recognition: From Theory to Applications, NATO ASI Series F, Computer and Systems Sciences, vol. 163, pp. 446–456, 1998. [61] K. Weinberger and L. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” in Proc. IEEE Conf. CVPR, 2004, pp. 988–995. [62] J. A. Bilmes, J. Malkin, X. Li, S. Harada, K. Kilanski, K. Kirchhoff, et al., “The vocal joystick,” in Proc. IEEE Conf. Acoust., Speech Signal Process., 2006. [63] J. J. Hull, “A database for handwritten text recognition research,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, no. 5, pp. 550–554, May 1994.

1807

Xinwei Jiang received the Ph.D. degree from Huazhong University of Science and Technology, Wuhan, China, in 2012. He is currently a Lecturer with China University of Geosciences, Wuhan. His current research interests include nonparametric statistical models, machine learning, and dimensionality reduction.

Junbin Gao received the B.Sc. degree in computational mathematics from Huazhong University of Science and Technology (HUST), Huazhong, China, in 1982, and the Ph.D. degree from Dalian University of Technology, Dalian, China, in 1991. He is currently a Professor of Computing Science with the School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW, Australia. From 2001 to 2005, he was a Lecturer and a Senior Lecturer in Computer Science at the University of New England, Armidale, NSW. From 1982 to 2001, he was an Associate Lecturer, Lecturer, Associate Professor, and Professor with the Department of Mathematics at HUST. His current research interests include machine learning, data mining, Bayesian learning and inference, and image analysis.

Tianjiang Wang is currently a Professor with the School of Computer science, Huazhong University of Science and Technology, Wuhan, China. He has finished some related projects and published more than 30 related papers. His current research interests include machine learning, computer vision, data mining, and so on.

Daming Shi received the master’s and the Ph.D. degrees in mechanical control from Harbin Institute of Technology, Harbin, China, and the second Ph.D. degree in computer science from the University of Southampton, Southampton, U.K. He was an Assistant Professor at Nanyang Technological University, Singapore, and a Reader at Middlesex University, London, U.K. He is currently a Distinguished Professor at Harbin Institute of Technology. His current research interests include machine learning, medical image processing, pattern recognition, and neural networks.

TPSLVM: a dimensionality reduction algorithm based on thin plate splines.

Dimensionality reduction (DR) has been considered as one of the most significant tools for data analysis. One type of DR algorithms is based on latent...
9MB Sizes 1 Downloads 7 Views