IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

1291

Adaptive Data Embedding Framework for Multiclass Classification Tingting Mu, Member, IEEE, Jianmin Jiang, Yan Wang, and John Y. Goulermas, Senior Member, IEEE

Abstract— The objective of this paper is the design of an engine for the automatic generation of supervised manifold embedding models. It proposes a modular and adaptive data embedding framework for classification, referred to as DEFC, which realizes in different stages including initial data preprocessing, relation feature generation and embedding computation. For the computation of embeddings, the concepts of friend closeness and enemy dispersion are introduced, to better control at local level the relative positions of the intraclass and interclass data samples. These are shown to be general cases of the global information setup utilized in the Fisher criterion, and are employed for the construction of different optimization templates to drive the DEFC model generation. For model identification, we use a simple but effective bilevel evolutionary optimization, which searches for the optimal model and its best model parameters. The effectiveness of DEFC is demonstrated with experiments using noisy synthetic datasets possessing nonlinear distributions and real-world datasets from different application fields. Index Terms— Bilevel evolutionary optimization, classification, embedding, proximity information.

I. I NTRODUCTION

P

ERCEPTUALLY meaningful structure of multivariate data often has fewer independent degrees of freedom than the input dimensionality [1], [2]. To embed data points into a lower dimensional “description” space capable of capturing such intrinsic degrees of freedom has become an essential preprocessing stage before conducting further analysis of the data. Therefore, a large number of algorithms on embedding and manifold learning have been proposed in recent years [3], [4]. These algorithms mainly differ in the way they compute embeddings in order to preserve certain properties and characteristics of the original high-dimensional data. For instance, principal component analysis (PCA) [5], latent semantic indexing (LSI) [6], and multidimensional scaling [7], attempt to

Manuscript received August 25, 2011; revised January 15, 2012; accepted May 10, 2012. Date of publication June 18, 2012; date of current version July 16, 2012. This work was supported in part by the European Framework7 Program, EcoGem Project, under Contract 260097. T. Mu and Y. Wang are with the School of Computing Informatics and Media, University of Bradford, Bradford BD7 1DP, U.K. (e-mail: [email protected]; [email protected]). J. Jiang is with the School of Computer Science and Technology, Tianjin University, Tianjin 300072, China, and also with the Department of Computing, University of Surrey, Guildford GU2 7XH, U.K. (e-mail: [email protected]). J. Y. Goulermas is with the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool L69 3GJ, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2200693

preserve the data variance, the k-rank approximation of the input feature matrix, and the Euclidean distances between all the original data points, respectively. Various other methods on manifold learning and spectral analysis [1], [2], [8]–[16], e.g., orthogonal locality preserving projections (OLPP) and orthogonal neighborhood preserving projections (ONPP) [8], target the preservation of the intrinsic geometry of the data, as captured by the aggregate pairwise proximity information based on constructed local neighborhood graphs. The above methods work in an unsupervised manner, and without utilizing any label or output information; they only provide a compact and informative representation of the data. By generating embeddings based only on feature-based properties of the data, however, the final performance may suffer when a classification task is the ultimate objective. This is because real-world classification datasets often possess incompatible arrangements between the input features and the output labels, such as when patterns from different/same classes are located closely/distantly in the feature space. In such cases, embeddings computed by unsupervised techniques may accentuate such incompatibilities and lead to increased classification errors. Therefore, supervised embedding techniques have been developed to generate embeddings and simultaneously improve the separability between classes [17]–[32]. Most of these methods attempt to move sets of intraclass points close together in the embedded space and sets of interclass points apart. They achieve this through the use of various rules for controlling the relative positions between the embeddings. For instance, Fisher discriminant analysis (FDA) [17] and maximum margin criterion (MMC) [18] encourage all the points from the same class to stay close, while all the points from different classes to move apart. Discriminant neighborhood embedding (DNE) [19], marginal Fisher analysis (MFA) [20], and the methods in [27]–[29] only encourage neighboring intra and interclass points to be close and far, respectively. Other methods use mixtures of such rules. For example, local FDA [23] keeps neighboring intraclass points together and all enemy points apart; repulsion OLPP (OLPP-R) [26] keeps all the intraclass points together and only neighboring enemy ones apart [26]; also repulsion ONPP [26] and discriminative ONPP [30] preserve the local data structure within each class separately, while encouraging neighboring interclass points to move apart. Various objective functions along with constraints are formulated to despatch the underlying optimization for such methods. These include different types of trace

2162–237X/$31.00 © 2012 IEEE

1292

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

optimization that can be solved by an eigen-solver [33]–[35] or semidefinite programming [15], [36]. Other supervised methods that rely on either learning a good subspace projection or using neighbor constraints to induce linear and nonlinear classifiers include [37], [38]. Given the large number of embedding algorithms developed by machine learning researchers, which possess similar goals but differ in technical details, it is very difficult for users from application fields such as bioinformatics, biomechanics, natural language processing and text mining to select the most appropriate algorithm to process their data. This is because the technical formulations of the available methods alone cannot always justify suitability of a method to a specified task, and performance is often coupled to and strongly dependent on the inherent characteristics of the particular dataset. In this paper, we provide an effective modular framework or engine to alleviate this problem. Instead of fixing the embedding generation model and then train it with a given dataset, it attempts to generate a model that is optimally suitable to the given dataset, without any requirement from the user for expert settings. The approach we follow is adaptive to the characteristics of the dataset, and it is flexible because the generated model can not only assume the form of many existing embedding algorithms but also generate many new formulations. These new formulations can vary in simple ways, such as basic data preprocessing, or in more complex ways, such as the way intra and interclass neighborhoods are constructed or how their optimization, constraints, and regularization are defined. The proposed data embedding framework for classification (DEFC) is modularly built on various components: that is, data preprocessing; relation feature generation; embedding computation; and optimization. The final optimization is based on an efficient evolutionary optimization operating upon two nested levels. At the upper level, the optimal embedding model is created, while at the lower level the numerical parameters of the created model are optimized. The organization of this paper is as follows. The main data preprocessing and relation generation components are described in Sections II-A and II-B. Details on the model formulation and the introduced friend and enemy concepts are given in Section II-C. Sections II-D and II-E describe the model construction and optimization stages. Section III reports the experimental results and comparative analyses, while Section IV concludes this paper. II. P ROPOSED F RAMEWORK {xi }ni=1

Let denote a set of data points (samples) of dimension d belonging to c different classes, where xi = [x i1 , x i2 , . . . , x id ]T is the i th row of the n × d feature matrix X = [x i j ]. Their corresponding class information is modeled as an n × c binary matrix Y = [yi j ], where yi j = 1 if the i th sample belongs to the j th class C j , and yi j = 0 otherwise. By n i (for i = 1, . . . , c), we denote the number of points from the i th class. The objective is to learn a mapping function ψ : Rd → Rk from X or both X and Y, which generates the k-dimensional (k < d) embeddings {zi = ψ(xi )}ni=1 , with zi = [z i1 , z i2 , . . . , z ik ]T . Using

matrix notation, this mapping is also denoted as Z = (X), where Z = [z i j ] is the n × k embedding matrix. In addition to reducing the dimensionality of the original features and recovering the low-dimensional manifolds the data samples lie on, this mapping must also enhance class discriminability in the embedded space. This mapping can be subsequently used to project m new testing or query points { x i }m i=1 , where T  xi = [ x i1 ,  x i2 , . . . ,  x id ] . This leads to new embeddings xi )}m { zi = ψ( i=1 , composing the resulting m × k embedding  matrix Z = [ z i j ], where zi = [ z i1 ,  z i2 , . . . ,  z ik ]T . The applied   mapping Z = (X) should also enable higher discriminability in the embedded than the original space. Different from many existing embedding algorithms which work directly on the original input features X, the proposed DEFC supports the computation of embeddings from a new set of features obtained through selective transformations of the data samples, and also of the relation features. Along with the proposed embedding computation, these constitute the three main processing components of DEFC, and are described in Sections II-A, II-B, and II-C, respectively. A. Basic Data Preprocessing Data preprocessing techniques, e.g., normalization and whitening, can potentially improve the performance of many machine learning methods [39]. Here for DEFC, we apply a wide set of such simple techniques, which can be applied in a mutually exclusive manner depending on the model parameters that control them. These are as follows. 1) Scaling: It makes each sample xi a unit 2-norm vector by scaling along each dimension.  2) Centering: It subtracts the mean (1/n) ni=1 xi from each sample, so that each dimension of the data possesses zero mean. 3) Standardization: It subtracts the mean from each sample and divides each feature by the corresponding standard deviation, in order to transform the data to samples with zero mean and unit variance. 4) PCA: It applies PCA to the data, by eigen-decomposing the input data covariance matrix and linearly transforming the centered data using the matrix Mr with columns the r eigenvectors corresponding to the largest r nonzero eigenvalues. This decorrelates the data by making the new covariance diagonal and eliminates the subspace of lower variance. 5) Whitening: It applies the linear transformation −(1/2) to the centered data, where r is the Mr r eigenvalue matrix of the covariance with the largest r nonzero eigenvalues. It operates like PCA, but it additionally spheres the data by imposing an identity covariance to the transformed samples. The above preprocessing methods are reformulated in matrix presentation in Table I. Details on how these techniques are employed in DEFC are given in Section II-D. B. Generation of Relation Features Compared to the original input features (or their preprocessed versions in Table I), relational values obtained

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION

TABLE I BASIC D ATA P REPROCESSING T ECHNIQUES IN M ATRIX P RESENTATION , X. G IVEN THE I NPUT F EATURE M ATRIX X AND THE Q UERY M ATRIX 

1293

TABLE II L IST OF (D IS )S IMILARITY M EASURES B ETWEEN T WO I NPUT V ECTORS xi AND x j

T HE F UNCTION diag(·) R ETURNS THE C ORRESPONDING D IAGONAL M ATRIX FOR A V ECTOR I NPUT, OR THE C ORRESPONDING D IAGONAL V ECTOR FOR A M ATRIX I NPUT

Method

Training samples

Scaling

s−1 X

Centering

Query samples

Whitening

(In×n − n1 1n×n )XMr (In×n − n1 1n×n )XMr

−1  2 r

Polynomial kernel

(1 + xiT x j ) p , p > 0 xiT x j xi x j  xiT x j

xi 2 +x j 2 −xiT x j

Euclidean based: Euclidean norm

   X − n1 1m×n X z−1 Standardization (In×n − n1 1n×n )Xz−1 

 1 1 T where z = diag n−1 diag X (In×n − n 1n×n )X PCA

xiT x j

Tanimoto coefficient

 X − n1 1m×n X

  X−   X−

Dot-product

Cosine similarity

X s−1    where s = diag diag(XXT ) (In×n − n1 1n×n )X

Dot-product based:

Gaussian kernel Alternative weight [26]



1 n 1m×n X Mr

xi  +x j 

Correlation based:

1 n 1m×n X Mr



xi − x j 

x −x 2 ,σ >0 exp − i σ j

−1 2 xi −x j  τ+ 2 2

−1  2 r

by a (dis)similarity measure φ(·, ·) between samples provide an alternative way of describing the information content of the dataset. This is because samples can be adequately characterized through their relative positions, which are represented by their (dis)similarities with other samples [40]. Previous research [41] has demonstrated the effectiveness of relational values used as the input to classifiers instead of the original samples. Our former work [42] also indicates that embeddings computed from relational values possess similar discriminant power to those computed from the original features. Therefore, we design DEFC to optionally support the use of relational values (or relation features) between an input sample and all the training samples as the input to the embedding computation module. This gives rise to an n × n feature matrix R = [ri j ] for the training samples with its i j th element computed as φ(xi , x j ), and an m × n feature xi , x j ) matrix  R = [ ri j ] for the m query samples, with φ( being its i j th element. Because relation features capture the interactions between samples, they can facilitate the discovery of nonlinear structures within the data. Another advantage is that, when the original data dimensionality is extremely high (d  n), the relation features produce a computationally tractable arrangement for those embedding methods whose computational complexity depends on d; these include the projection-based spectral embedding algorithms of PCA, LSI, LPP, FDA, etc. In general, any (dis)similarity measure can be used to compute relation features. In DEFC, we employ popular ones from the literature, based on the Euclidean and the dot-product measures, as well as Pearson’s correlation coefficient; these are summarized in Table II. C. Computation of Embeddings Most embedding algorithms for classification consider both the local neighborhood and class structures of the data to be of importance, as opposed to unsupervised ones which compute embeddings solely based on feature proximities.

    1 d x x j t − d1 dt=1 x j t t=1 xit − d t=1 it 2  2   xit − d1 dt=1 xit x j t − d1 dt=1 x j t

d

Pearson’s correlation



These structures are preserved or sharpened by modifying the relative positions between the embedded points. One common  way for doing this is to minimize ni, j =1 wi j zi − z j 2 under the orthogonality condition. The weights {wi j }ni, j =1 control the degree of “closeness” between the embedded points, which is typically computed from the neighborhood information and/or the class labels of the original samples [26], [29]. In this paper, instead of evaluating the relative positions between all the points, we formulate specific measures to quantify separately the relative positions between the intraclass samples and also between the interclass samples, and then we use these measures to formulate the optimization problem for computing the embeddings within the proposed DEFC. 1) Friend Closeness and Enemy Dispersion: The labeled data samples can with pairwise constraints   be characterized expressed as xi , x j , δi j , where δi j = +1, if xi and x j are from the same class, and δi j = −1 otherwise [43], [44]. Although in this case the indicator δi j relies only on class information, it can be extended to incorporate local neighborhood information [19]. In our case, one possibility is the three-value indicator ⎧ +1, if xi and x j are from the same class and ⎪ ⎪ ⎪ ⎪ also undirected k F -NNs ⎨ δi j = −1, if xi and x j are from different classes and (1) ⎪ ⎪ also undirected k E -NNs ⎪ ⎪ ⎩ 0, otherwise. The search for the undirected k F -nearest neighbors (NNs), for the case of δi j = +1, is restricted to samples of the same class only. The search for the undirected k E -NNs, for the case of δi j = −1, is restricted to samples from the two classes of xi and x j only. We refer to xi and x j as pairwise friends when δi j = +1, and pairwise enemies when δi j = −1. Let S = [si j ] denote the n × n similarity matrix (used in the above NN searches) computed using the similarity measures in Table II, from either the original samples X or their

1294

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

preprocessed versions as described in Table I. Then, the indicator δi j can be defined as a function of S, the number k F or k E of NNs, and the label matrix Y; that is, δi j ≡ δi j (k E , k F , S, Y). For notational simplicity, we will denote it as δi j (k F , S, Y) for pairwise friends, and δi j (k E , S, Y) for pairwise enemies. In the following, we introduce two measurable concepts of friend closeness (C F ) and enemy dispersion (D E ). These are given by 



n  δi j (k F , S, Y) δi j (k F , S, Y) + 1 g(si j )zi − z j 2 2

CF =

i, j =1

(2)   n  δi j (k E , S, Y) δi j (k E , S, Y) − 1 DE = g(si j )zi − z j 2 2 i, j =1

where D(W) = diag (W × 1n×1 )  −1 YT Ww = Y YT Y  −1 1 YT. Wb = 1n×n − Y YT Y n (w)

Sw = Sb =

c  





zi −

t =1 zi ∈Ct  c 

nt



×

zj

nt z j ∈Ct

nt

zj

zj



n

k=1 z k





z j ∈Ct

nt

zj

T (4)

n T

n −



zi −

nt

z j ∈Ct

t =1

z j ∈Ct

k=1 z k

n

.

(5)

However, the difference is that the Fisher criterion permits all the intraclass and interclass samples to contribute to the scatter calculations, while C F and D E operate on selective pairwise friend and enemy contributions. To further analyze this relationship, we observe that Sw and Sb can be re-expressed with a more concise matrix notation, using only the embedding and label matrices Z and Y, respectively, as Sw = ZT (D(Ww ) − Ww ) Z Sb = ZT (D(Wb ) − Wb ) Z

(6) (7)

(9) (10) (b)

The above weight matrices Ww = [wi j ] and Wb = [wi j ] contain indicator elements which depend only on the label information Y. The optimizing quantities of Sw and Sb Fisher criterion are typically based on their traces because the sum of the eigenvalues corresponds to overall variance along the principal data directions. It can be easily verified from (6) and (7) that these traces can be calculated (in the embedded space) as

(3) where g(·) is a weight function set to either g(si j ) = 1 or g(si j ) = si j . The objective of C F and D E is to evaluate the aggregated weighted distances between the embedded pairwise friends and enemies, respectively. To reinforce class separability in the embedded space, one must generate embeddings that minimize and maximize C F and D E , respectively. In the weighted version (i.e., g(si j ) = si j ), more similar friends in C F have larger weights si j and, thus, they are forced to come even closer in the embedded space; this improves within-class compactness. At the same time, more similar enemies in D E are very likely to be boundary points, so larger weights si j force them further apart from each other in the embedded space; this improves the overall between-class separation. It can be noted that the notion of simultaneously minimizing and maximizing the two measures of C F and D E to achieve better separability bears similarity with standard methods such as the Fisher criterion [17]. This makes use of the within-class (Sw ) and between-class scatter (Sb ), which are defined (in the embedded space) as

(8)

trace[Sw ] = trace[Sb ] =

n 1  (w) wi j zi − z j 2 2

(11)

1 2

(12)

i, j =1 n 

2 wi(b) j zi − z j 

i, j =1 T

= trace[Z D(Wb )Z] −

n 

(b)

wi j ziT z j

i, j =1

=−

n 

(b)

wi j ziT z j

(13)

i, j =1

where the last equality holds because D(Wb ) = 0n×n . Thus, in the above, minimizing trace[Sw ] is equivalent to penalizing the dissimilarities (norm distances) between intraclass points. Maximizing trace[Sb ] is equivalent to rewarding the dissimilarities (norms) between interclass points or penalizing their similarities (inner products). Furthermore, as mentioned before, the indicator elements (b) wi(w) j and wi j in (9) and (10) can be defined solely upon label (w)

information Y. Specifically, wi j takes the value of (1/n t ) if xi and x j (or zi and z j ) are from the same tth class, and the value of 0 otherwise. For wi(b) j , we have the value of (1/n) − (1/n t ) if xi and x j are from the same tth class, and (1/n) otherwise. We can see that the two traces of the Fisher criterion in (11) and (12) are very special cases of the introduced friend closeness C F and enemy dispersion D E , respectively. Specifically, both sets (2), (3) and (11), (12) make use of indicators; the former use (1/2)δi j (δi j ± 1), while the latter (1/2)wi(w,b) . However, the exact values of such indicators j are conceptually irrelevant, as they mainly aim to achieve various normalizations. The important difference here is that δi j depend not only on the labels Y, but also on the local similarities S utilized from the k F nearest friend and k E nearest (and wi(b) enemy searches, whereas wi(w) j j ) depend only on global label information in Y, without any need for S. Another difference between (2), (3) and (11), (12) is that the latter implies that weights g(si j ) = 1 are always used, and this does not distinguish stronger friend similarities or proximal border instances. Finally, in (11) and (12) the two traces can be expressed in terms of each other and the total dissimilarity

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION

sum, since

or

n 1  1 (w) − wi j zi − z j 2 trace[Sb ] = 2 n =

i, j =1 n 

1 2n

zi − z j 2 − trace[Sw ]

max   P XT BX+λId×d PT =Ik×k (14)

i, j =1

while for (2) and (3) this is not possible due to the separately treated local friend and enemy dependencies. Between the global character of the Fisher reformulation in (11) and (12), and the fine-grained local search proposed in (2) and (3), we can define intermediate models. For example, we can directly extend the Fisher model in (11) and (12) by simply incorporating local search. One straightforward (b) way of doing so is to redefine the indicators wi(w) j and wi j to include a k N -NNs search using the proximity matrix S. (w) (w) For example, wi j ≡ wi j (k N , S, Y) = (1/n t ), if xi and x j are from the same tth class also undirected k N -NNs of each other (where search is deployed to all samples from all classes), and 0 otherwise. The redefinition is similar for (b) wi(b) j ≡ wi j (k N , S, Y). The Fisher-based extensions of C F and D E are now given as CF =

n w (w) (k , S, Y)  N ij i, j =1

DE =

2

n w (b) (k , S, Y)  N ij i, j =1

2

g(si j )zi − z j 2

(15)

g(si j )zi − z j 2

(16)

where the weight function g(si j ) is included. Given a data matrix X and a measure for computing S, the quantities C F and D E in both models in (2), (3), (15), and (16) are functions of the numbers of NNs (k E , k F , and k N ) and S. These parameters control the degree of fine-graininess of the local structure taken into account. When k E , k F , and k N reach their maximum values, C F and D E become measures describing only the global class structure of data. Specifically, by setting k N = n, C f , and De in (15) and (16) assume the within- and between-class scatter matrix trace values as in the Fisher criterion, respectively (when g(si j ) = 1). 2) Friend- and Enemy-Based Optimization Templates: In order to generate optimal embeddings that reduce dimensionality as well as improve class compactness and separability, we must optimize the two measures of C F and D E . This can take place in various ways, including the simultaneous minimization of C F and maximization of D E , the individual minimization of C F , or the individual maximization of D E . Various trace optimization templates can be used to achieve these options. Here, we modify the templates suggested by [20] and [33] and consider regularization for preventing overfitting [45], [46]. By reformulating (2), (3), (15), and (16) in matrix forms, defining two template matrices A = [ai j ] and B = [bi j ], and employing the linear constraint Z = XP, we enable the proposed DEFC to support the following optimization problems for computing embeddings. 1) Minimization of C F and maximization of D E by either

max trace PXT (A − λB)XPT (17) PPT =Ik×k

1295



trace PXT AXPT

(18)

where λ > 0 is the regularization parameter. When (2) and (3) are used, the two template matrices are the Laplacian matrices calculated, given as A = D(A )− A and B = D(B ) − B, where the proximity matrices A = [ai j  ] and B = [bi j  ] are given by   ai j  = δi j (k E , S, Y) δi j (k E , S, Y) − 1 g(si j ) (19)   bi j  = δi j (k F , S, Y) δi j (k F , S, Y) + 1 g(si j ) (20) dropping the constant denominators as they do not affect the optimization. When (15) and (16) are used, A and B are computed by (b)

ai j  = wi j (k N , S, Y)g(si j ) 

bi j =

wi(w) j (k N , S, Y)g(si j ).

(21) (22)

2) Maximization of D E by max

PPT =Ik×k



trace PXT AXPT

(23)

where A is either computed using (19) or with (21). 3) Minimization of C F by

(24) min trace PXT BXPT PPT =Ik×k

with B defined using either (20) or (22). The optimal projection matrix P∗ can be obtained by solving a generalized eigen-decomposition problem. Then embeddings for the m query points are computed by  Z =  XP∗ . When relation features are used, the transformation Z = RP is applied, and consequently, the input X of all the above optimization problems is replaced with R, while embeddings for the query points are computed by  Z= RP∗ . D. Model Construction In this section, we describe the proposed DEFC framework in terms of the three previously presented processing components. Given an input matrix X of a set of training samples and their associated label matrix Y, DEFC operates along the following stages. 1) The original input samples X are optionally preprocessed to one of the five preprocessing transformations listed in Table I. We denote the output from this stage as X p , and use an integer parameter i p ∈ ℵ0,5 to control which preprocessing option is selected (with 0 corresponding to no preprocessing applied). The notation ℵa,b here corresponds to the set of integer numbers from a to b, inclusive. 2) Subsequently, the output from the previous stage is optionally subjected to the calculation of relation features R, using one of the eight (dis)similarity measures from Table II. We denote the output from this stage as Xr and use an integer parameter i r ∈ ℵ0,8 to control the relation feature calculation (0 corresponds to no relation feature generation).

1296

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

TABLE III S WITCHES AND N UMERICAL PARAMETERS OF THE DEFC M ODEL FOR A LL S TAGES . F OR E ACH D IFFERENT S WITCH , T HERE A RE D IFFERENT O PTIONAL N UMERICAL PARAMETERS

Stage

Model parameters (switches)

(numerical)

Parameter: i p ∈ ℵ0,5

1) Preprocessing

i p = 0: none, i p = 1: scaling, ⇓

Model parameters

N/A

i p = 2: centering, i p = 3: standardization, i p = 4: PCA, i p = 5: whitening. Parameter: ir ∈ ℵ0,8

2) Relation computation



ir = 0: none, ir = 1: dot,

ir = 2: p

ir = 2: polynomial, ir = 3: cosine,

ir = 6: σ

ir = 4: Tanimoto, ir = 5: Euclidean,

ir = 7: τ

ir = 6: Gaussian, ir = 7: alt. weight, ir = 8: correlation. Parameter: it ∈ ℵ0,5

3) Preprocessing relation

it = 0: none, it = 1: scaling, ⇓

N/A

it = 2: centering, it = 3: standardization, it = 4: PCA, it = 5: whitening. Parameter: is ∈ ℵ1,7

4) Proximity computation



is = 1: dot, is = 2: polynomial,

is = 2: p

is = 3: cosine, is = 4: Tanimoto,

is = 5: σ

is = 5: Gaussian, is = 6: alt. weight,

is = 6: τ

is = 7: correlation. 5) Embedding generation

g

Parameters: iem ∈ ℵ1,2 , ie ∈ ℵ1,2 , ieo ∈ ℵ1,4 iem = 1: (2) and (3), iem = 2: (15) and (16).



g g ie = 1: g(si j ) = 1, ie = 2: g(si j ) = si j .

ieo = 1: (17), ieo = 2: (18), ieo = 3: (23), ieo = 4: (24).

All: k iem = 1: k E and/or k F iem = 2: k N ieo = 1, 2: λ

6: Postprocessing

Parameter: io ∈ ℵ0,1 io = 0: none, io = 1: projection orthonormalization.

3) The model can then optionally subject the relation features to one of the preprocessing schemes of Table I, similar to Stage 1 above. If, however, no relation feature generation is applied in Stage 2, then no preprocessing takes place here, to avoid repeating the preprocessing step redundantly. The output from this stage is denoted as Xt , and the parameter controlling the preprocessing scheme as i t ∈ ℵ0,5 . 4) The transformed features X p from Stage 1 are used to estimate the similarity matrix S needed in

N/A

the neighborhood searches of Section II-C1. S is calculated using seven similarity measures from Table II (excluding the Euclidean norm as this gives dissimilarity). The controlling parameter of this stage is i s ∈ ℵ1,7 . 5) At this stage, the embeddings are calculated as described in Section II-C2, using Xt instead of X. The model can be constructed here with various options controlled by different parameters. The two versions of friend closeness and enemy dispersion in (2), (3), (15), and

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION

(16) are controlled by i em ∈ ℵ1,2 ; the two types of g weights g(si j ) are controlled by i e ∈ ℵ1,2 ; and the four versions of optimization templates in (17), (18), (23), and (24) by i eo ∈ ℵ1,4 . 6) Finally, the obtained projection vectors P∗ are optionally orthonormalized depending on the control parameter i o ∈ ℵ0,1 . Table III summarizes all the available options of the DEFC and their corresponding numerical model parameters for all stages. The modular structure of the DEFC permits it to generate many linear and nonlinear embedding algorithms different from existing ones. Additionally, many popular embedding algorithms can be represented for specific parameterizations. g For example, for i p = 0 and i r = 0, when i em = 2 and i e = 1, and with k E and k F set to their maximum values, we obtain the MMC and a regularized version of FDA, for i eo = 1 and i eo = g 2, respectively. When i em = 1, i e = 1, and i eo = 1 with λ = 1, g m the DNE is obtained. When i e = 1, i e = 1, and i eo = 2 with λ = 0, a version of MFA is obtained with a slightly different way for choosing NNs. By changing i r = 0 to i r = 2 or 6 which activates the computation of relation features using kernel functions, we obtain the kernel extension for the corresponding embedding algorithms as discussed in [8] and [20]. E. Model Identification In this section, we explain how to automatically identify a model under the DEFC framework and generate the optimal embeddings for a given classification task. At the upper level, the shape of the model is controlled by an eight-dimensional g switching integer vector I = [i p , i r , i t , i s , i em , i e , i eo , i o ], whose elements correspond to the multiple design components described in the six processing steps of Section II-D and the middle column of Table III. This exhibits strong expressive power, as it can represent a large number of 6 × 9 × 6 × 7 × 16 × 2 = 72 576 different models. For each of these models, a different set of numerical parameters is needed to instantiate it, at the lower level, to a working embedding generation algorithm. All these parameters are summarized in the third column of Table III. For example, when a Gaussian kernel is used for relation feature generation in Stage 2 (for i r = 6), σ becomes a parameter of the model, which does not exist when correlation is used (for i r = 8). Other parameters, such as the embedding dimensionality k in Stage 5, is always needed for all available models. For a given model switching vector I, we use ν(I) to denote its corresponding numerical parameters required to be optimized. For example, for I = [0, 6, 1, 6, 2, 1, 2, 0] we have ν(I) = [σ, τ, k, k N , λ]; these include the Gaussian width σ for the relation features, τ for the alternative weight of the similarity matrix, the embedding dimensions k, k N for the NNs search of (15) and (16), and the regularizer λ in (18). Similarly, the model I = [5, 2, 1, 4, 1, 2, 1, 1] has numerical parameter vector ν(I) = [ p, k, k F , k E , λ], and I = [4, 0, 0, 1, 1, 2, 3, 0] has ν(I) = [k, k E ]. In order to fully identify an embedding generator model, we need to select both an optimal switch I∗ and an optimal

1297

parameter set ν ∗ (I∗ ) for the particular I∗ . Since the objective of this paper is the generation of supervised embeddings to be used as the input to a given classifier, the optimization underlying the model identification needs to take into account some classification performance. For the purpose of enhancing generalization, this performance needs to be based on an outof-sample estimation error, computed from the samples X and labels Y of the training dataset. Therefore, we define an objective function E(I, ν(I), X, Y, ), which estimates the misclassification rate of a given classifier , on embeddings Z generated by a model instantiated by parameters I and ν(I) (as implemented by the six stages in Section II-D) using a cross-validation partitioning of the training set or a separate independent validation set. The overall model identification procedure can be formulated through the following bilevel optimization problem:   min E I, ν ∗ (I), X, Y, I

s.t. I ∈ ℵ0,5 ×ℵ0,8 ×ℵ0,5 ×ℵ1,7 ×ℵ1,2 ×ℵ1,2 ×ℵ1,4 ×ℵ0,1 ν ∗ (I) ∈ argmin E (I, ν(I), X, Y, ) ν(I)

s.t.ν(I) ∈ V (I)

(25)

where V (I) is the set of feasible values for the numerical parameters of the model corresponding to I. The above bilevel optimization problem is a mixed one, since at the upper level the parameters are all discrete, while at the lower level both discrete and continuous. There are various methods for solving this problem [47]. In this paper, we make use of evolutionary optimization, and specifically a nested genetic algorithm (GA) [48] to optimize the two levels. For the particular problem, GAs can take advantage of the problem structure and efficiently scan the solution space, without the need for enumerating all possible models and optimizing exhaustively each parameter set. For each individual, that is a potential solution switch vector I from the GA population of the upper level, another population is processed, with members being the numerical parameters ν(I). The fitness function of each model I from the upper level population corresponds to the negative misclassification rate E, which is in turn calculated using the optimal parameters ν ∗ (I) from the lower level population. Regarding the GA setup, a Gaussian mutation is used for the continuous parameters, and a simple label swap for the discrete ones, with a mutation rate of 0.1. For crossover, a uniform one is used for the discrete parameters and heuristic and intermediate arithmetical crossovers for the continuous ones. A portion (80%) of the new offspring is created by crossover, with the remaining via mutating selected parents. The final solution from the GA are all the optimal parameters I∗ and ν ∗ (I∗ ). III. E XPERIMENTAL A NALYSIS AND C OMPARISONS To evaluate the proposed DEFC, we compare it with nine popular embedding algorithms, which include the three unsupervised ones PCA [5], LSI [6], and OLPP [8], and six supervised ones including FDA [21], MFA [20], DNE [19], OLPP-R [26], and the kernel extensions of MFA and FDA [20], [49], denoted as KFDA and KMFA, respectively. An

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

1 Available at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ multiclass.html. 2 Available at http://archive.ics.uci.edu/ml/support/Reuters-21578+Text+ Categorization+Collection.

300

300

350

350

400

50

100 150 200 250 300 350 400

350 400 450

400

50

50 100 150 200 250 300 350 400 450

100 150 200 250 300 350 400

(a)

(b)

(c)

50

50

50

100

100

100

150

150

150

200

200

200

250

250

250

300

300

300

350

350

350

400

400

400

450

450

450

500

50 100 150 200 250 300 350 400 450

50 100 150 200 250 300 350 400 450 500

(d)

500

50 100 150 200 250 300 350 400 450 500

(e)

(f)

Fig. 1. Comparison of the class structures based on the proximity matrices computed from the original features. (a) Banana. (b) Swiss Roll. (c) Satimage. (d) DNA. (e) Reuters (Gaussian). (f) Reuters (cosine). 4

4

3

3

2

2

1

1

0.05

0.04

0.03

0.02

0.01 2

z

0

0

0 −0.01

−1

−1

−0.02

−0.03

−2

−2 −0.04

−3 −4

−3

−2

−1

0

1

2

−3 −4

3

−3

−2

−1

0 z

x

1

1

2

3

−0.05 −0.05

4

−0.04

−0.03

−0.02

−0.01

1

(a)

0 z

0.01

0.02

0.03

0.04

0.05

1

(b)

(c)

−0.02

0.02 163.055

0.01

163.05

−0.04

0

163.045

−0.06 −0.01

163.04

2

overview of these algorithms can be found in [4]. The main conceptual difference between DEFC and such methods is that DEFC is able to evaluate a large number of potential embedding models and automatically generate an optimally suitable one for a given dataset, while the existing methods are based on single fixed models. Also, as previously mentioned in previous sections, FDA/KFDA, MFA/KMFA, and DNE are actually models that can be potentially generated by DEFC. We use five multiclass classification datasets, which include the two synthetic Banana and Swiss roll possessing nonlinear distributions, and three real-world datasets from different application fields, which are the satellite images (Satimage) and the DNA dataset from the Statlog collection,1 and also the Reuters documents from the “Reuters-21578 Text Categorization Test Collection” that contains articles taken from the Reuters newswire.2 Samples from each dataset are divided into three independent partitions for the purpose of training, validation, and testing, where the number of training samples is deliberately kept small in order to show that our proposed method can discover robust embeddings given only a small number of examples. The characteristics for all datasets are summarized in Table IV. To visually analyze the data, we also display in Fig. 1 all the proximity matrices computed with the Gaussian kernel (also the cosine similarity for Reuters). We use the original features from the training sets and order the samples so that intraclass ones appear consecutively. From Fig. 1(a)–(c), it can be seen that the original features of the Banana, Swiss roll, and Satimage datasets show discernible class separations, but with the class structures not separated distinctly, which implies the existence of noise and interclass sample proximities. In the case of the DNA dataset, Fig. 1(d) shows that these problems are worse, while Fig. 1(e) and (f) shows that the Reuters data appear more distinct with the cosine similarity rather than the Gaussian. Two nonparametric classifiers are selected for model identification and classification in (25). One is a simple nonlinear classifier, the 1-NN, and the other a simple linear classifier, the Fisher’s linear discriminant analysis (FLDA) [50]. The choice for these two is based on our objective to investigate the discriminating ability of the resulting embeddings as well as the fact that using a sophisticated classifier would dilute the contribution of the embeddings to the final classification performance. The Gaussian kernel is employed for KFDA and

250

300

−0.08

z

Testing 2900 2800 2000 1186 1229

250

150

2

Validation 2000 2400 3946 1503 1230

200

250

100

z

Training 400 400 489 497 500

200

200

2

Features 2 3 36 180 69 593

150

z

Classes 2 4 6 3 11

50 100

150

2

Dataset Banana Swiss roll Satimage DNA Reuters

50

50 100

x

TABLE IV D ETAILS AND PARTITIONING OF THE F IVE D ATASETS U SED

z2

1298

163.035

−0.02

163.03

−0.03

163.025

−0.04

−0.1

−0.12 −0.05

163.02

−0.14 −0.2

−0.15

−0.1

−0.05

0

0.05 z

1

(d)

0.1

0.15

0.2

0.25

−40.985

−40.98

−40.975

−40.97 z

−40.965

−40.96

−0.06 −0.05

−0.04

−0.03

1

(e)

−0.02 z1

−0.01

0

0.01

(f)

Fig. 2. Illustration of the original and the embedded Banana test samples generated with different methods. The corresponding test set accuracies using FLDA classification are also shown in parentheses. (a) Original (70.71%). (b) PCA (37.61%). (c) MFA (58.48%). (d) KFDA (88.21%). (e) KMFA (82.97%). (f) DEFC (89.41%).

KMFA to incorporate nonlinearities. Parameters for existing embedding algorithms are tuned by grid search for less than two parameters, otherwise using a GA. The model identification and parameter optimization procedures for all experiments and all methods are performed using the validation set, with the final model assessments using the test sets. The GA toolbox of MATLAB is used to implement the bilevel model optimization. All experiments were conducted using MATLAB R2011a on a 3.06-GHz CPU and 4-GB memory machine running Mac OS X. A. Comparative Analysis With Synthetic Datasets We first perform a qualitative analysis, by illustrating the computed 2-D embeddings for the Banana and Swiss roll datasets, using PCA, LSI, OLPP, FDA, MFA, DNE, OLPP-R, KFDA KMFA, and the proposed DEFC, where the classification performance for model identification in (25) is based on FLDA. Comparisons between the distributions of the original samples and their embeddings can be made using Figs. 2 and 3. Because, for the banana set, the three unsupervised methods of PCA, LSI, and OLPP output identical embeddings, we only display PCA in Fig. 2. Also, for both Banana and

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION 15

10

10

5

1299

−0.045

0.07

0.3

0.06

0.25

8

−0.05 6

−0.055

4

0.05

0.04

0

−0.065

0.03

z

−0.07

2

0.15 z2

2

z

z2

2

5

z

0 z3

0.2

−0.06

2

−2 −4

0.1

0

−5

−0.075

−5

−10

−0.085

0.02

−6

−0.08

0.05

−8

0.01

−10 −12 15

0 10

5

−5

0

0

−0.09

5 0

10 −10

15

−10 −10

z

2

−5

0

5 z

z1

(a)

10

−15 −16

15

−14

−12

−10

(b)

20

−8

−6 z1

1

−4

−2

0

2

−0.095 0.04

4

0.05

0.06

0.08

0.09

−0.01 −0.065

0.1

−0.06

−0.055

−0.05

−0.045

−0.04 z

−0.035

−0.03

−0.025

−0.02

−0.05 0.01

−0.015

0.02

0.03

0.04

(b)

6

30

4

25

0.05

0.06

0.07

0.08

z1

1

(a)

(c)

0.15

0.07 z1

(c) 0.56

0.04

0.54 0.1

15

0.02

20

2

0.52 0.05

15

10

0

0

0.5

10 z2

z2

z

2

z2

−2

−0.02

z

2

z

2

0

5

5

−0.05

0.48

−4 0

0

−0.04

0.46

−6

−0.1

−5 −5

−0.06

−0.15

−10 −15

−10

−5

0

5

−0.2

10

0

0.1

0.2

0.3

z1

0.4

0.5

0.6

−0.08 −0.06

0.7

−0.04

−0.02

0

z

0.02

0.04

0.06

0.08

z1

1

(d)

(e)

−10

−4

−3

−2

−1

0

1

2

3

4

−15 −40

−30

z

1

(d)

(f)

−20

−10

0 z

1

(e)

10

20

30

40

0.42 −0.76

−0.74

−0.72

−0.7

−0.68

−0.66

−0.64

−0.62

−0.6

−0.58

z1

(f)

4

0.3 −923.536

2 0.2

−923.538

0 −923.54

0.1

−2 −923.542

0

−4

z2

z

z

2

2

−923.544

−923.546

−6

−0.1 −923.548

−8 −0.2

−923.55

−10 −923.552

−0.3

−12

−923.554

−0.4 0.4

0.44

−8

−10 −5

0.5

0.6

0.7 z1

(g)

0.8

0.9

1

−923.556 514.616

514.618

514.62

514.622

514.624 z1

(h)

514.626

514.628

514.63

514.632

−14 −25

−20

−15

−10 z

−5

0

1

(i)

Fig. 3. Illustration of the original and the embedded Swiss roll test samples generated with different methods. The corresponding test set accuracies using FLDA classification are also shown in parentheses. (a) Original (70.71%). (b) PCA (37.61%). (c) LSI (60.57%). (d) OLPP (71.00%). (e) FDA (70.68%). (f) MFA (71.11%). (g) KFDA (89.11%). (h) KMFA (85.86%). (i) DEFC (89.32%).

Swiss roll, some of the supervised methods FDA, MFA, DNE, and OLPP-R give very similar embeddings, so we only display MFA in Fig. 2 and FDA and MFA in Fig. 3. The latter methods are selected as representatives in the figures, because they perform slightly better in classification than the ones left out. It can be seen from Fig. 2(a)–(c) that the unsupervised and the linear supervised embedding methods generate embeddings that have distributions almost identical to those of the original samples. Since they preserve the intrinsic geometry of the original feature space, the (model assessment using the test set) classification accuracies for FLDA in Fig. 2(a) and (c) are close to those of Fig. 2(a). In Fig. 2(a)–(f), however, we can see that the three supervised nonlinear methods KFDA, KMFA, and DEFC are able to reshape the distributions of the embeddings to more discriminant configurations. The FLDA accuracies in these three figures, compared to Fig. 2(a), show around 25%–35% improvement. Fig. 2(f) shows that DEFC produces the best FLDA accuracy and achieves good betweenclass separability, while KFDA in Fig. 2(d) follows as the second best method. Similar experiments were conducted for the Swiss roll dataset, to seek a description space with improved class separability, and meanwhile reduce the d = 3 original dimensions to k = 2. It can be seen from Fig. 3(a)–(f) that all the unsupervised and linear supervised ones do not work well, in terms of redistributing the classes and increasing accuracy. This is because the Swiss roll dataset does not have redundant features and these methods are only able to discover a 2-D planar view for the data, and, therefore, they cannot compress the data without causing information loss. In this case, more complex nonlinear supervised algorithms are able to represent the data using fewer features and with improved class discriminability.

Fig. 4. Embedded test samples for the Banana (top row) and Swiss roll data (bottom row) computed by various well-performing DEFC models. The g switching model vector I = [i p , ir , it , is , iem , ie , ieo , io ] is shown, along with the order it appeared in the upper level GA population. The test set classification accuracies are also shown and are annotated by whether 1-NN or FLDA is used. (a) I = [5, 5, 0, 1, 1, 1, 2, 0], second, FLDA: 88.21%. (b) I = [2, 5, 0, 6, 2, 2, 2, 0], third, FLDA: 87.62%. (c) I = [5, 5, 0, 1, 1, 1, 2, 1], first, 1-NN: 87.45%. (d) I = [0, 2, 3, 2, 2, 1, 3, 1], second, FLDA: 88.64%. (e) I = [3, 2, 3, 6, 2, 2, 2, 1], third, FLDA: 88.14%. (f) I = [0, 5, 0, 5, 1, 1, 2, 0], first, 1-NN: 86.71%.

As can be seen in Fig. 3(a), (i), and (j), both KFDA and DEFC generate embeddings with comparatively clear class separations and with improved classification rates. Although KMFA is based on the same model selection procedure, it does not perform as well. In Figs. 2(f) and 3(i), we present the optimal DEFC models I∗ with the highest validation FLDA performance given by (25). Fig. 4 also shows various embeddings computed by other best performing DEFC models, corresponding to the top second and third GA individuals using FLDA, and top first (optimal) ones using 1-NN validation performance. All optimal and near-optimal models seem to perform and separate the classes well, which shows the robustness of the proposed framework. B. Benchmark Evaluation To evaluate the accuracy of the proposed framework, we compare DEFC with the nine existing algorithms using the datasets in Table IV. For DEFC and all competing methods, model selection is conducted using both 1-NN and FLDA classifiers separately, leading to two sets of optimal settings, which for DEFC we present in Table VI. Models selected based on 1-NN validation performance are assessed with 1-NN but using the testing portion of the datasets, as shown in Table IV, and similarly for FLDA. Table V summarizes the performances from all experiments. For the Reuters dataset only, because of its very high dimensionality d = 69 593, the FLDA (last table column) cannot be applied because it requires the eigen-decomposition of a d × d matrix. Thus, we report the linear classification performance obtained by a linear support vector machine instead of FLDA. Also, the table does not report accuracies for the embedding algorithms of PCA, LSI, OLPP, FDA, DNE, OLPP-R, and MFA, because these are based on an eigen-decomposition of a d × d matrix. Therefore, only the accuracies for KMFA, KFDA, and DEFC

1300

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

50 50

50

100

100

150

150

200

200

250

250

300

300

350

350

100

150

200

250

300

350

400

450 400

50

100

150

200

250

300

350

400

400

50

100

150

(a)

200

250

300

350

50

400

50

50

50

100

100

150

150

150

200

200

200

250

250

250

300

300

300

350

350

350

400

400

450

450

100

150

200

250

(d)

150

200

300

350

400

450

500

250

300

350

400

450

(c)

100

50

100

(b)

400

450

50

100

150

200

250

(e)

300

350

400

450

500

500

50

100

150

200

250

300

350

400

450

500

(f)

Fig. 5. Comparison of the class structures based on the proximity matrices computed from the DEFC embeddings. (a) Banana. (b) Swiss roll. (c) Satimage. (d) DNA. (e) Reuters (Gaussian). (f) Reuters (cosine).

are recorded, as these require the computationally cheaper eigen-decomposition of an n × n matrix due to their kernel or relation feature extension. It can be seen from Table V that DEFC outperforms all the existing methods for all the datasets. It can be noticed that some existing embedding algorithms even reduce the performance compared to that of the original features for certain dataset and classifier cases, e.g., the Satimage and Reuters datasets. For the Satimage data with the 1-NN classifier, many supervised embedding methods, such as DNE, FDA, KFDA, and KMFA, perform even worse than the unsupervised ones, such as PCA and LSI. In Table V, the performance ranking of various existing embedding methods differs notably for the different datasets. This indicates that the generation of an embedding model that is optimally suitable to a given dataset can be much more effective than having a fixed model shape for all cases. Therefore, the robustness of DEFC is justifiable from its flexible and adaptive design as well as its data-driven optimization, which allows it to derive the best fit model for the problem at hand. Additionally, the numbers of computed embeddings recorded in Table VI show that DEFC can drastically reduce feature dimensionalities; for example, the DNA dataset was reduced from d = 180 to less than k = 9, and Reuters from d = 69 593 to k = 10 (also, see Table IV for dataset information). In Fig. 5, we display the proximity matrices computed from the DEFC embeddings using the same samples and similarity measures as in Fig. 1, in order to demonstrate the improved class separability. Comparison of Figs. 1 and 5 show that the DEFC proximity matrices have much more distinct class structures than the original ones, e.g., Figs. 1(d) and 5(d) for the DNA dataset. Also, for the Reuters data, the original proximity matrices computed with the Gaussian kernel in Fig. 1(e) and the cosine similarity in Fig. 1(f) show absent and indistinct class structures, respectively. However, the corresponding embedded proximity matrices in Fig. 5(e) and (f) show distinctly the existence of 11 classes with both the Gaussian and cosine measures. Table VI presents the optimal embedding algorithms I∗ generated by DEFC through the upper level GA, based on either the 1-NN or the FLDA validation performance of (25).

Various observations that can be made form Table VI include the following. Regardless of whether 1-NN or FLDA is used, the same technique (controlled by i p ) is usually chosen to preprocess the original features for the same dataset. Also, standardization and whitening seem to be the most frequently selected preprocessing methods. In the second stage, relation feature generation seems to be critical for all the datasets, as it enables the discovery of nonlinear structure within the data. The Euclidean distance (i r = 5) and the polynomial kernel (i r = 2) are more frequently selected than other (dis)similarity measures for calculating relation features. For the calculation of the proximity matrix S, the polynomial kernel (i s = 2) is most often chosen to model the local structure of the data, with the dot product (i s = 1), the Gaussian kernel (i s = 5), and the alternative weight (i s = 6) following in frequency. For the embedding computation of the fifth stage, the optimization template in (18) (i eo = 2) is identified as the most beneficial in the vast majority of experiments. Another observation is that almost all the embedding algorithms created by DEFC are new models. The only existing one is produced for the Swiss roll dataset, based on FLDA performance. This is the KFDA method with polynomial kernel which corresponds to g the setting i p = i t = 0, i r = i em = i eo = 2, and i e = 1 without KNN search (k N reached its maximum value). We also compare the within- and between-class scatter quantities in (11) and (12) defined in the original Fisher criterion, with the proposed friend closeness and enemy dispersion in (2), (3), (15), and (16). The Reuters dataset that possesses the highest feature dimensionality is used for the evaluation. To compare these three sets of quantities objectively, the same relation features computed by the polynomial kernel are used as the input to generate embeddings, and no feature preprocessing is applied. Also, the same optimization template in (17) is employed to minimize the within-class scatter or friend closeness, while maximizing the between-class scatter or enemy dispersion. In all cases, the same classifier FLDA is used to assess the classification performance. The resulting classification accuracies are 85.76%, 88.12%, and 86.74% for the original Fisher, the proposed criterion in (2) and (3), and the one in (15) and (16), respectively. Compared to the original Fisher criterion, the two proposed designs of friend closeness and enemy dispersion are more flexible, as they take into account the local feature and global class structures of the data distributions. C. Computational Cost Analysis Given a classification task, DEFC generates the optimal embedding model as well as its corresponding optimal model parameters, using the bilevel GA-based search. Letting n I denote the total number of candidate models evaluated by the upper level GA and ti the computational time expended in the lower level for determining  the parameters of the i th nI candidate model, DEFC requires i=1 ti time for overall model identification. Existing embedding methods only have single fixed models and only require the optimization of their model parameters. In the following, we summarize model selection costs for the experimented methods.

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION

1301

TABLE V P ERFORMANCE C OMPARISON IN T ERMS OF C LASSIFICATION A CCURACY U SING THE T EST PARTITION OF D IFFERENT D ATASETS . T HE F INAL D IMENSIONALITIES k OF THE E MBEDDINGS A RE S HOWN IN PARENTHESES Banana (%)

Swiss Roll (%)

Satimage (%)

DNA (%)

Reuters (%)

Methods

1-NN

FLDA

1-NN

FLDA

1-NN

FLDA

1-NN

FLDA

1-NN

Linear

Original

84.79(2)

58.48(2)

83.51(3)

70.71(3)

85.90(36)

71.50(36)

66.19(180)

87.69(180)

43.12(69 593)

83.56(69 593)

PCA

84.76(2)

58.48(2)

55.14(2)

37.61(2)

86.50(11)

74.25(3)

78.08(4)

88.95(77)

NA

NA NA

LSI

84.76(2)

58.48(2)

65.50(2)

60.57(2)

86.05(16)

74.20(3)

76.90(5)

89.38(74)

NA

OLPP

84.79(2)

58.48(2)

79.46(2)

71.00(2)

85.70(33)

74.25(3)

78.33(14)

89.21(70)

NA

NA

FDA

85.00(2)

58.48(2)

68.86(2)

70.68(2)

83.25(5)

79.60(2)

87.18(2)

87.69(2)

NA

NA

DNE

84.79(2)

58.48(2)

77.07(2)

71.14(2)

84.90(6)

79.95(2)

87.86(3)

88.20(2)

NA

NA

OLPP-R

84.76(2)

58.48(2)

79.07(2)

66.96(2)

85.65(32)

80.30(2)

88.36(3)

88.95(73)

NA

NA

MFA

85.03(2)

58.48(2)

78.96(2)

71.11(2)

86.00(20)

75.20(5)

87.77(2)

87.94(174)

NA

NA

KFDA

85.48(2)

88.00(2)

85.68(2)

89.11(2)

84.30(6)

83.10(28)

86.30(24)

90.30(152)

87.14(8)

77.30(343)

KMFA

81.41(2)

86.34(2)

82.61(2)

85.86(2)

80.55(4)

83.60(21)

89.88(3)

90.39(49)

58.75(98)

73.15(440)

DEFC

87.45(2)

89.41(2)

85.79(2)

89.32(2)

86.85(22)

85.25(34)

91.57(7)

90.98(8)

91.38(10)

91.95(10)

TABLE VI g O PTIMAL I∗ = [i p , ir , it , is , iem , ie , ieo , io ] M ODEL V ECTORS (C OLUMNS , AT T OP P ORTION OF TABLE ) G ENERATED BY DEFC U SING VALIDATION P ERFORMANCE FROM 1-NN AND FLDA, AND FOR A LL D ATASETS Banana

Preprocessing (i p )

Swiss roll

Satimage

DNA

Reuters

1-NN

FLDA

1-NN

FLDA

1-NN

FLDA

1-NN

FLDA

1-NN

FLDA

5

5

0

0

3

3

3

3

5

3

Relation (ir )

5

5

5

2

4

2

4

6

2

2

Rel. preprocessing (it )

0

0

0

0

2

3

3

3

2

2

Proximity (is )

1

6

5

2

2

1

2

7

5

6

C F and D E (iem ) g Weighting (ie ) Optimization (ieo )

1

1

1

2

1

2

1

1

2

1

1

1

1

1

2

2

2

2

1

2

2

2

2

2

2

2

2

2

2

1

Orthonormalization (io )

1

0

0

1

1

0

1

1

0

1

For PCA and LSI, we only need to determine the embedding dimensionality k, which in our hardware corresponds to an average model parameterization time ti of around 20 s for the employed datasets. For the remaining existing methods, the number of parameters to be tuned varies between 2 and 5 (such as the embedding dimensionality k, the numbers of interclass and intraclass NNs, the kernel and the regularization parameters, etc.). In these cases, ti varies between about 10 min to 1 h, depending on the method and the dataset. For DEFC, each of the n I candidate models contains between one and six parameters (see Section II-E for more details), and the model parameterization time ti varies between about 8 min to 2 h, depending on the form of the model and the dataset. To obtain the DEFC results in Table V, the upper GA level is evaluated to be around n I = 60 to 150 different candidate models. Regarding the memory requirements of DEFC, they are very low because the GA search only operates on small population matrices, based on a very low cost encoding scheme that mixes integer discrete and real-valued parameters. For both GA levels, low size populations of 25 individuals are used, while chromosomes are of short lengths (8 genes for the upper level, and 1–6 genes for the lower one).

Sometimes, the high computational times of DEFC, necessitated by the very wide model space it supports (see Section II-E), may be a burden for certain applications. In this case, it is possible to pursue a suboptimal solution by narrowing down its search space of candidate models. For example, as observed in Table VI, standardization is frequently selected to preprocess the original features (i p = 3); polynomial kernel and Euclidean distance are popular measures for computing relation features (i r = 2, 5); while polynomial and Gaussian kernels are popular for proximity computation (i s = 2, 5). Also, the optimization template in (18) appears to be quite effective for most studied datasets (i eo = 2). Thus, the computational requirements can be reduced by conducting the search with these options fixed. Additionally, for simplifig cation, one could use i e = 2 to fix the weight functions, and even disable the preprocessing of relation features (i t = 0) and the postprocessing of projections (i o = 0). This finally leads to a total of 2 × 2 × 2 = 8 potential models, reduced from a total of 72 576, and makes an exhaustive search for the best model possible. To test this simplified DEFC version, we conducted experiments using the Reuters dataset. The obtained 1-NN accuracy

1302

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

is 90.89% with a total computational time of 10 h, and the FLDA accuracy is 91.70% with a computational time of 9 h. This performance is almost as good as the one recorded in Table V, but the computational time has been greatly reduced by more than 75%. Of course, depending on the time constraints, users can further simplify the DEFC model search by imposing even more frugal switch options or by experimentally narrowing them down to the most effective switch subset for an entire class of datasets of interest with similar statistical properties. In such cases, any subsequent parameter fine-tuning will be required for only very few or even a single model, which could make the computational requirements comparable to those of other supervised methods with fixed models. IV. C ONCLUSION In this paper, we investigated embedding computation to facilitate multiclass classification tasks. A flexible and adaptive embedding framework was proposed, capable of acting as an engine for generating new embedding algorithms. It was based on a modular definition of various components, such as data preprocessing, relation feature generation, and the introduced concepts of friend closeness and enemy dispersion. The latter were shown to be generalizations of Fisher’s criterion, by incorporating local neighborhood information in the intraand interclass sample proximity estimations. Different optimization models were incorporated for embedding generation, while the DEFC engine was based on a bilevel formulation utilizing two simple optimizers maximizing validation or crossvalidation performance, one for model identification at the upper level and one for parameter training at the lower level. Experimental results demonstrated the effectiveness of the proposed framework by comparing it with nine existing embedding algorithms, including unsupervised, supervised, linear, and nonlinear ones, and different synthetic and real datasets. R EFERENCES [1] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [2] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [3] L. J. P. van der Maaten, E. O. Postma, and H. J. van den Herik, “Dimensionality reduction: A comparative review,” Tilburg Centre Creative Comput., Tilburg Univ., Tilburg, The Netherlands, Tech. Rep. TiCC-TR 2009-005, 2009. [4] T. Mu, J. Y. Goulermas, S. Ananiadou, and J. Tsujii, “Proximity-based frameworks for generating embeddings from multi-output data,” IEEE Trans. Pattern Anal. Mach. Intell., DOI:10.1109/TPAMI.2012.20. [5] I. T. Jolliffe, Principal Component Analysis. New York: SpringerVerlag, 1986. [6] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” J. Amer. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, 1990. [7] W. S. Torgerson, “Multidimensional scaling: I. Theory and method,” J. Psychometrika, vol. 17, no. 4, pp. 401–419, 1952. [8] E. Kokiopoulou and Y. Saad, “Orthogonal neighborhood preserving projections: A projection-based dimensionality reduction technique,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 12, pp. 2143– 2156, Dec. 2007. [9] X. He and P. Niyogi, “Locality preserving projections,” in Proc. Neural Inf. Process. Syst. 16, 2003, pp. 1–8.

[10] P. K. Chan, M. D. F. Schlag, and J. Y. Zien, “Spectral K-way ratio-cut partitioning and clustering,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 13, no. 9, pp. 1088–1096, Sep. 1994. [11] Y. Pan, S. Ge, and A. A. Mamun, “Weighted locally linear embedding for dimension reduction,” Pattern Recognit., vol. 42, no. 5, pp. 798– 811, 2009. [12] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [13] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and algorithm,” in Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press, 2001. [14] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, 2003. [15] C. Hou, C. Zhang, Y. Wu, and Y. Jiao, “Stable local dimensionality reduction approaches,” Pattern Recognit., vol. 42, no. 9, pp. 2054–2066, 2009. [16] J. Wang and Z. Zhang, “Nonlinear embedding preserving multiple local-linearities,” Pattern Recognit., vol. 43, no. 4, pp. 1257–1268, 2010. [17] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Ann. Eugen., vol. 7, no. 2, pp. 179–188, 1936. [18] H. Li, T. Jiang, and K. Zhang, “Efficient and robust feature extraction by maximum margin criterion,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 157–165, Jan. 2006. [19] W. Zhang, X. Xue, Z. Sun, Y. Guo, and H. Lu, “Optimal dimensionality of metric space for classification,” in Proc. 24th Int. Conf. Mach. Learn., vol. 227. 2007, pp. 1135–1142. [20] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 40–51, Jan. 2007. [21] P. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces versus fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711–720, Jul. 1997. [22] T. Hastie and R. Tibshirani, “Discriminant adaptive nearest neighbor classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 6, pp. 607–616, Jun. 1996. [23] M. Sugiyama, “Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis,” J. Mach. Learn. Res., vol. 8, pp. 1027–1061, May 2007. [24] H. Wang, S. Chen, Z. Hu, and W. Zheng, “Locality-preserved maximum information projection,” IEEE Trans. Neural Netw., vol. 19, no. 4, pp. 571–585, Apr. 2008. [25] S. Ji and J. Ye, “Generalized linear discriminant analysis: A unified framework and efficient model selection,” IEEE Trans. Neural Netw., vol. 19, no. 10, pp. 1768–1782, Oct. 2008. [26] E. Kokiopouloua and Y. Saadb, “Enhanced graph-based dimensionality reduction with repulsion Laplaceans,” Pattern Recognit., vol. 42, no. 11, pp. 2392–2402, 2009. [27] T. Zhang, D. Tao, X. Li, and J. Yang, “Patch alignment for dimensionality reduction,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1299–1313, Sep. 2009. [28] Q. Gao, H. Xu, Y. Li, and D. Xie, “Two-dimensional supervised local similarity and diversity projection,” Pattern Recognit., vol. 43, no. 410, pp. 3359–3363, 2010. [29] W. K. Wong and H. T. Zhao, “Supervised optimal locality preserving projection,” Pattern Recognit., vol. 45, no. 1, pp. 186–197, Jan. 2012. [30] T. Zhang, K. Huang, X. Li, J. Yang, and D. Tao, “Discriminative orthogonal neighborhood-preserving projections for classification,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 40, no. 1, pp. 253–263, Feb. 2010. [31] E. Rodriguez-Martinez, J. Y. Goulermas, T. Mu, and J. F. Ralph, “Automatic induction of projection pursuit indices,” IEEE Trans. Neural Netw., vol. 21, no. 8, pp. 1281–1295, Aug. 2010. [32] Y. Zhang and D. Yeung, “Semisupervised generalized discriminant analysis,” IEEE Trans. Neural Netw., vol. 22, no. 8, pp. 1207–1217, Aug. 2011. [33] E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigenproblems in dimension reduction methods,” Numer. Linear Algebra Appl., vol. 18, no. 3, pp. 565–602, 2011. [34] H. Wang, S. Yan, T. Huang, and X. Tang, “Trace ratio versus ratio trace for dimensionality reduction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8.

MU et al.: ADAPTIVE DATA EMBEDDING FRAMEWORK FOR MULTICLASS CLASSIFICATION

1303

[35] Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 729–735, Apr. 2009. [36] C. Shen, H. Li, and M. J. Brooks, “Supervised dimensionality reduction via sequential semidefinite programming,” Pattern Recognit., vol. 41, no. 12, pp. 3644–3652, 2008. [37] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, “Proximal methods for sparse hierarchical dictionary learning,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 1–8. [38] P. Piro, R. Nock, F. Nielsen, and M. Barlaud, “Leveraging k-NN for generic classification boosting,” Neurocomputing, vol. 18, no. 15, pp. 3–9, 2011, [39] A. Ng, J. Ngiam, C. Foo, Y. Mai, and C. Suen. (2011). Unsupervised Feature Learning and Deep Learning (UFLDL) Tutorial [Online]. Available: http://ufldl.stanford.edu/wiki/index.php/Data_Preprocessing [40] I. Borg and J. Lingoes, Multidimensional Similarity Structure Analysis. New York: Springer-Verlag, 1987. [41] E. Pekalska, R. Duin, and P. Paclik, “Prototype selection for dissimilarity-based classifiers,” Pattern Recognit., vol. 39, no. 2, pp. 189–208, 2006. [42] T. Mu and S. Ananiadou, “Proximity-based graph embeddings for multi-label classification,” in Proc. Int. Conf. Knowl. Discovery Inf. Retri., 2010, pp. 74–84. [43] R. Yan, J. Zhang, J. Yang, and A. G. Hauptmann, “A discriminative learning framework with pairwise constraints for video object classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 578–593, Apr. 2006. [44] T. Mu, A. K. Nandi, and R. Rangayyan, “Analysis of breast tumors in mammograms using the pairwise rayleigh quotient classifier,” J. Electron. Imaging, vol. 16, no. 4, p. 043004, 2007. [45] F. Bach and M. Jordan, “Kernel independent component analysis,” J. Mach. Learn. Res., vol. 3, pp. 1–48, Jul. 2002. [46] L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1024–1031. [47] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel optimization,” Ann. Oper. Res., vol. 153, no. 1, pp. 235–256, 2007. [48] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley, 1989. [49] S. Mika, G. Rätsch, J. Weston, B. Schölkopf, and K. Muller, “Fisher discriminant analysis with kernels,” in Proc. IEEE Neural Netw. Signal Process. Workshop, Aug. 1999, pp. 41–48. [50] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001.

Jianmin Jiang received the B.Sc. degree from the Shandong Mining Institute, Jiangsu, China, the M.Sc. degree from the China University of Mining and Technology, Beijing, China, and the Ph.D. degree from the University of Nottingham, Nottingham, U.K., in 1982, 1984, and 1994, respectively. He is currently a Professor of media computing with the University of Surrey, Guildford, U.K. He is also a Consulting Professor with the Chinese Academy of Sciences and Southwest University, Chongqing, China. His current research interests include image and video processing in compressed domains, digital video coding, stereo image coding, medical imaging, computer graphics, machine learning and AI applications in digital media processing, and retrieval and analysis. Prof. Jiang is a fellow of the Institution of Electrical Engineers, a fellow of the Royal Society of Arts, a member of the Engineering and Physical Sciences Research Council College, and an EU FP-6/7 evaluator. He was the recipient of the Outstanding Overseas Chinese Young Scholarship Award (Type-B) from the Chinese National Sciences Foundation in 2000 and the Outstanding Overseas Academics Award from the Chinese Academy of Sciences in 2004. He is a Chartered Engineer.

Tingting Mu (M’05) received the B.Eng. degree in electronic engineering and information science from the University of Science and Technology of China, Hefei, China, and the Ph.D. degree in electrical engineering and electronics from the University of Liverpool, Liverpool, U.K., in 2004 and 2008, respectively. She is currently a Post-Doctoral Researcher with the School of Computing, Informatics and Media, University of Bradford, Bradford, U.K. Her current research interests include machine learning, pattern recognition, data mining, and mathematical modeling, with applications to information retrieval, text mining, and bioinformatics and biomedical engineering.

John Y. Goulermas (M’98–SM’10) received the B.Sc. degree (first class) in computation from the University of Manchester Institute of Science and Technology (UMIST), Manchester, U.K., in 1994, and the M.Sc. (by research) and Ph.D. degrees from the Control Systems Center, Department of Electrical and Electronic Engineering, UMIST, in 1996 and 2000, respectively. He is currently a Reader with the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool, U.K. His current research interests include machine learning, mathematical modeling and optimization, and image processing with applications to biomedical engineering and industrial monitoring and control.

Yan Wang received the B.Eng. degree in computer science and technology from the Beijing Jiaotong University, Beijing, China, in 2001, the M.Sc. degree in information technology, and the Ph.D. degree in computer science from the University of Nottingham, Nottingham, U.K., in 2003 and 2009, respectively. He is currently a Post-Doctoral Researcher with the School of Computing, Informatics and Media, University of Bradford, Bradford, U.K. His current research interests include computer vision, image processing, geographical information systems, and advanced driver assistance systems.

Adaptive data embedding framework for multiclass classification.

The objective of this paper is the design of an engine for the automatic generation of supervised manifold embedding models. It proposes a modular and...
3MB Sizes 0 Downloads 4 Views