Guaranteed classification via regularized similarity learning.

LETTER

Communicated by Purushottam Kar

Guaranteed Classification via Regularized Similarity Learning Zheng-Chu Guo [email protected]

Yiming Ying [email protected] College of Engineering, Mathematics and Physical Sciences, University of Exeter, EX4 4QF, UK

Learning an appropriate (dis)similarity function from the available data is a central problem in machine learning, since the success of many machine learning algorithms critically depends on the choice of a similarity function to compare examples. Despite many approaches to similarity metric learning that have been proposed, there has been little theoretical study on the links between similarity metric learning and the classification performance of the resulting classifier. In this letter, we propose a regularized similarity learning formulation associated with general matrix norms and establish their generalization bounds. We show that the generalization error of the resulting linear classifier can be bounded by the derived generalization bound of similarity learning. This shows that a good generalization of the learned similarity function guarantees a good classification of the resulting linear classifier. Our results extend and improve those obtained by Bellet, Habrard, and Sebban (2012). Due to the techniques dependent on the notion of uniform stability (Bousquet & Elisseeff, 2002), the bound obtained there holds true only for the Frobenius matrix-norm regularization. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and its related Khinchin-type inequality enable us to establish bounds for regularized similarity learning formulations associated with general matrix norms, including sparse L1 -norm and mixed (2,1)-norm. 1 Introduction The success of many machine learning algorithms heavily depends on how to specify the similarity or distance metric between examples. For instance, the k-nearest neighbor (k-NN) classifier depends on a distance (dissimilarity) function to identify the nearest neighbors for classification. Most Z.-C. Guo is now at the Department of Mathematics, Zhejiang University, Hangzhou 310027, China. Neural Computation 26, 497–522 (2014) doi:10.1162/NECO_a_00556

c 2014 Massachusetts Institute of Technology

498

Z.-C. Guo and Y. Ying

information retrieval methods rely on a similarity function to identify the data points that are most similar to a given query. Kernel methods rely on the kernel function to represent the similarity between examples. Hence, how to learn an appropriate (dis)similarity function from the available data is a central problem in machine learning, which we refer to as similarity metric learning throughout the letter. Recently considerable research effort has been devoted to similarity metric learning, and many methods have been proposed. They can be broadly divided into two main categories. The first category is a one-stage approach for similarity metric learning, which means that the methods learn the similarity (kernel) function and classifier together. Multiple kernel learning (Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004; Varma & Babu, 2009) is a notable one-stage approach that aims to learn an optimal kernel combination from a prescribed set of positive semidefinite (PSD) kernels. Another exemplary one-stage approach is indefinite kernel learning, which is motivated by the fact that in many applications, potential kernel matrices could be nonpositive semidefinite. Such cases include hyperbolic tangent kernels (Smola, Ovari, & Williamson, 2001) and the protein sequence sim´ ilarity measures derived from Smith-Waterman and BLAST score (Saigo, Vert, Ueda, & Akutsu, 2004). Indefinite kernel learning (Chen, Garcia, Gupta, Rahimi, & Cazzanti, 2009; Ying, Campbell, & Girolami, 2009) aims to learn a PSD kernel matrix from a prescribed indefinite kernel matrix, which is mostly restricted to the transductive settings. Other methods (Wu & Zhou, 2005; Wu, 2013) have analyzed regularization networks such as ridge regression and SVM given a prescribed indefinite kernel instead of aiming to learn an indefinite kernel function from data. The generalization analysis for such one-stage methods has been studied (see, e.g., Chen et al., 2009; Cortes, Mohri, & Rostamizadeh, 2010a; Ying & Campbell, 2009). The second category of similarity metric learning is a two-stage method, which means that the processes of learning the similarity function and training the classifier are separate. One exemplar two-stage approach, referred to as metric learning (Bar-Hillel, Hertz, Shental, & Weinshall, 2005; Davis, Kulis, Jain, Sra, & Dhillon, 2007; Hoi, Liu, Lyu, & Ma, 2006; Jin, Wang, & Zhou, 2009; Weinberger, Blitzer, & Saul, 2005; Xing, Jordan, Russell, & Ng, 2002; Ying, Huang, & Campbell, 2009), often focuses on learning a Mahalanobis distance metric defined, for any x, x ∈ Rd , by dM (x, x ) = (x − x )T M(x − x ). Here, M is a positive semidefinite (PSD) matrix. Another example of such methods (Chechik, Sharma, Shalit, & Bengio, 2010; Maurer, 2008) is bilinear similarity learning, which focuses on learning a similarity function defined, for any x, x ∈ Rd , by sM (x, x ) = xT Mx with M being a PSD matrix. These methods are mainly motivated by the natural intuition that the similarity score between examples in the same class should be larger than that of examples from distinct classes. The k-NN classification using the similarity metric learned from these methods was empirically

Guaranteed Classification via Regularized Similarity Learning

499

shown to achieve better accuracy than that using the standard Euclidean distance. Although many two-stage approaches for similarity metric learning have been proposed, in contrast to the one-stage methods, there is relatively little theoretical work on the question of whether similarity-based learning guarantees a good generalization of the resultant classification. For instance, generalization bounds were recently established for metric and similarity learning (Cao, Guo, & Ying, 2012; Jin et al., 2009; Maurer, 2008) under different statistical assumptions on the data. However, there are no theoretical guarantees for such empirical success. In other words, it is not clear whether good generalization bounds for metric and similarity learning (Cao et al., 2012; Jin et al., 2009) can lead to good classification of the resultant k-NN classifiers. Recently, Bellet, Habrard, and Sebban (2012) proposed a regularized similarity learning approach, which is mainly motivated by the (, γ , τ )-good similarity functions introduced in Balcan and Blum (2006); Balcan, Blum, and Srebro (2008). In particular, they showed that the proposed similarity learning can theoretically guarantee good generalization for classification. However, due to the techniques dependent on the notion of uniform stability (Bousquet & Elisseeff, 2002), the generalization bounds hold true only for strongly convex matrix-norm regularization (e.g., the Frobenius norm). In this letter, we consider a new similarity learning formulation associated with general matrix-norm regularization terms. Its generalization bounds are established for various matrix regularizations including the Frobenius norm, sparse L1 -norm, and mixed (2,1)-norm (see definitions below). The learned similarity matrix is used to design a sparse classification algorithm, and we prove that the generalization error of its resultant linear classifier can be bounded by the derived generalization bound for similarity learning. This implies that the proposed similarity learning with general matrix-norm regularization guarantees good generalization for classification. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and the important Khinchin-type inequality for the Rademacher variables enables us to derive bounds for general matrixnorm regularization, including the sparse L1 -norm and mixed (2,1)-norm regularization. The remainder of this letter is organized as follows. In section 2, we propose the similarity learning formulations with general matrix-norm regularization terms and state the main theorems. In particular, the results will be illustrated using various examples. The related work is discussed in section 3. The generalization bounds for similarity learning are established in section 4. In section 5, we develop a theoretical link between the generalization bounds of the proposed similarity learning method and the generalization error of the linear classifier built from the learned similarity function. Section 6 estimates the Rademacher averages and gives the proof

500


for examples in section 2. Section 7 summarizes this letter and points to some possible directions for future research. 2 Regularization Formulation and Main Results In this section, we introduce the regularized formulation of similarity learning and state our main results. Before we do that, we introduce some notations and present some background material. Denote, for any n ∈ N, Nn = {1, 2, . . . , n}. Let z = {zi = (xi , yi ) : i ∈ Nm } be a set of training samples, which is drawn identically and independently from a distribution ρ on Z = X × Y . Here, the input space X is a domain in Rd , and Y = {−1, 1} is called the output space. Let Sd denote the set of d × d symmetric matrices. For any x, x ∈ X , we consider KA (x, x ) = xT Ax as a bilinear similarity score parameterized by a symmetric matrix A ∈ Sd . The symmetry of matrix A guarantees the symmetry of the similarity score KA , that is, KA (x, x ) = KA (x , x). The aim of similarity learning is to learn a matrix A from a given set of training samples z such that the similarity score KA between examples from the same label is larger than that between examples from different labels. A natural approach to achieve this aim is to minimize the following empirical error, ⎞ ⎛ 1 1 ⎝ 1− Ez (A) = yi y j KA xi , x j ⎠ , (2.1) m mr i∈Nm

j∈Nm

+

where r > 0 is the margin. Note that j∈Nm yi y j KA (xi , x j ) = { j:y j =yi }

KA (xi , x j ) − { j:y =y } KA (xi , x j ). Minimizing this empirical error encourj

i

ages, for any i, that with margin r, the average similarity scores between examples with the same class as yi are relatively larger than those between examples with distinct classes from yi . To avoid overfitting, we add a matrix-regularized term to the empirical error and reach the following regularization formulation, Az = arg min[Ez (A) + λA], A∈Sd

(2.2)

where λ > 0 is the regularization parameter. Here, the notation A denotes a general matrix norm. For instance, it can be the sparse L1 -norm A1 =

1

2 2 k∈Nd ∈Nd |Ak |, the (2,1)-mixed norm A(2,1) := k∈Nd ∈Nd Ak ,

1 2 2 the Frobenius norm AF = k,∈Nd Ak , or the trace norm Atr :=

∈Nd σ (A), where {σ (A) : ∈ Nd } denotes the singular values of matrix A. When A is a Frobenius norm, formulation 2.2 has been proposed by Bellet et al. (2012).


501

The first contribution of this letter is to establish generalization bounds for regularized similarity learning (see equation 2.1) with general matrix norms. Specifically, define

1 1− E (A) = yy KA (x, x )dρ(x , y ) dρ(x, y). (2.3) r Z Z + The target of generalization analysis for similarity learning is to bound E (Az ) − Ez (Az ). Its special case with the Frobenius matrix norm was established in Bellet et al. (2012). It used the uniform stability techniques (Bousquet & Elisseeff, 2002), which, however, cannot deal with nonstrongly convex matrix norms such as the L1 -norm, (2,1)-mixed norm and trace norm. Our new analysis techniques are able to deal with general matrix norms, which depend on the concept of Rademacher averages (Bartlett & Mendelson, 2002) defined as follows: Definition 1. Let F be a class of uniformly bounded functions. For every integer n, we call 1 σi f (zi ) , Rn (F ) := Ez Eσ sup f ∈F n i∈Nn

the Rademacher average over F , where {zi : i ∈ Nn } are independent random variables distributed according to some probability measure and {σi : i ∈ Nn } are independent Rademacher random variables, that is, P(σi = 1) = P(σi = −1) = 12 . Before stating our generalization bounds for similarity learning, we first introduce some notations. For any B, A ∈ Rn×d , let B, A = trace(BT A), where trace(·) denotes the trace of a matrix. For any matrix norm · , its dual norm · ∗ is defined, for any B, by B∗ = supA≤1 trace(BT A). Denote X∗ = supx,x ∈X x xT ∗ . Let the Rademacher average with respect to the dual matrix norm be defined by 1 T Rm := Ez,σ sup σi yi xi x˜ . (2.4) m ˜ X x∈ ∗ i∈Nm

Now we can state the generalization bounds for similarity learning, which is closely related to the Rademacher averages with respect to the dual matrix norm · ∗ . Theorem 1. Let Az be the solution to algorithm 2.2. Then for any 0 < δ < 1, with probability at least 1 − δ, there holds 6Rm 2X∗ 2 log 1δ + . (2.5) Ez (Az ) − E (Az ) ≤ rλ rλ m

502


The proof for theorem 1 is given in section 4. Recently, some studies (Cao et al., 2012; Kar & Jain, 2011; Kumar, Niculescu-Mizil, Kavukcoglu, & Daum, 2012) have given generalization bounds for similarity (kernel) learning, where the involved empirical term is a U-statistics term. Hence, the natural idea in the above studies is to explore properties of U-statistics (Clémencon, ˜ & Giné, 1999) for analyzing the related Lugosi, & Vayatis, 2008; de La pena similarity learning formulations. The proof for theorem 1 differs from earlier approaches (Cao et al., 2012; Kar & Jain, 2011) since the empirical term, equation 2.1, in formulation 2.2, is not a U-statistics term (there is more discussion in section 3 on this topic). Following the exact argument as the proof for theorem 1, a similar result is also true if we switch the position of Ez (Az ) and E (Az ), that is, for any 0 < δ < 1, with probability at least 1 − δ, we have 2X∗ 6Rm + E (Az ) ≤ Ez (Az ) + rλ rλ

2 log 1δ . m

The second contribution of this letter is to investigate the theoretical relationship between similarity learning, equation 2.2, and the generalization error of the linear classifier built from the learned metric Az . We show that the generalization bound for the similarity learning gives an upper bound for the generalization error of a linear classifier produced by the linear support vector machine (SVM; Vapnik, 1998), defined

1 fz = arg min 1 − yi f (xi ) + f ∈Fz m

i∈Nm s.t. ( f ) =: j∈N |α j | ≤ 1/r,

(2.6)

m

where Fz = { f : f = j∈N α j KA (x j , ·), a j ∈ R} is a sample-dependent hyz m pothesis space. The empirical error of f ∈ Fz associated with z is defined by Ez ( f ) =

1 1 − yi f (xi ) + . m i∈Nm

The true generalization error is defined as E(f) =

Z

1 − y f (x) + dρ(x, y).


503

Now we are in a position to state the relationship between the similarity learning and the generalization error of the linear classifier: Theorem 2. Let Az and f z be defined by equations 2.2 and 2.6, respectively. Then for any 0 < δ < 1, with confidence at least 1 − δ, there holds 4Rm 2X∗ + E ( f z ) ≤ Ez (Az ) + λr λr

2 log 1δ . m

(2.7)

The proof for theorem 2 will be established in section 5. Theorems 1 and 2 depend critically on two terms: the constant X∗ and the Rademacher average Rm . Below, we list the estimation of these two terms associated with different matrix norms. For any vector x = (x1 , x2 , . . . , xd ) ∈ Rd , denote x∞ = max∈N |x |. d

Example 1. Consider the sparse L1 -norm defined, for any A ∈ Sd , by A =

k,∈Nd |Ak |. Let Az and f z be defined, respectively, by equations 2.2 and 2.6. Then, we have the following results: 2 2 e log(d+1) . a. X∗ ≤ supx∈X x∞ and Rm ≤ 2 supx∈X x∞ m b. For any 0 < δ < 1, with confidence at least 1 − δ, there holds 2 12 supx∈X x∞ e log(d + 1) Ez (Az ) − E (Az ) ≤ rλ m 2 2 supx∈X x∞ 2 log 1δ . (2.8) + rλ m c. For any 0 < δ < 1, with confidence at least 1 − δ, there holds 2 4 supx∈X x∞ 2e log(d + 1) E ( fz ) ≤ Ez (Az ) + λr m 2 2 supx∈X x∞ 2 log 1δ + . λr m

(2.9)

For any vector x ∈ Rd , let xF be the standard Euclidean norm. Considering the regularized similarity learning with the Frobenius matrix norm, we have the following result: Example 2. Consider the Frobenius matrix norm defined, for any A ∈ Sd ,

2 by A = k,∈N |Ak | . Let Az and f z be defined by equations 2.2 and d

2.6, respectively. Then we have the following estimation: 2 2 a. X∗ ≤ supx∈X xF and Rm ≤ 2 supx∈X xF m1 .

504


b. For any 0 < δ < 1, with confidence at least 1 − δ, there holds 2 2 2 supx∈X xF 2 log 1δ 6 supx∈X xF + Ez (Az ) − E (Az ) ≤ . √ rλ m rλ m (2.10) c. For any 0 < δ < 1, with confidence at least 1 − δ, there holds 2 2 4 supx∈X xF 2 supx∈X xF 2 log 1δ . E ( fz ) ≤ Ez (Az ) + + √ λr m λr m (2.11) We end this section with two remarks. First, theorem 2 and the examples mean that a good similarity (i.e., a small Ez (Az ) for similarity learning) can guarantee a good classification (i.e., a small generalization error E ( fz )). Second, the bounds in example 2 are consistent with those in Bellet et al. (2012). 3 Related Work In this section, we discuss studies on similarity metric learning that are related to our work. Many similarity metric learning methods have been motivated by the intuition that the similarity score between examples in the same class should be larger than that of examples from distinct classes (Bar-Hillel, Hertz, Shental, & Weinshall, 2005; Cao et al., 2012; Chechik et al., 2010; Hoi et al., 2006; Jin et al., 2009; Maurer, 2008; Weinberger et al., 2005; Xing et al., 2002). Jin et al. (2009) established generalization bounds for regularized metric learning algorithms using the concept of uniform stability (Bartlett & Mendelson, 2002), which, however, works only for strongly convex matrix regularization terms. Very recently Cao et al. (2012) established generalization bounds for metric and similarity learning associated with general matrix norm regularization using the techniques of Rademacher averages and U-statistics. However, there were no theoretical links between the similarity metric learning and the generalization performance of classifiers based on the learned similarity matrix. Here, we focus on the problem of how to learn a good linear similarity function KA such that it can guarantee a good classifier derived from the learned similarity function. In addition, our formulation, equation 2.2, is quite distinct from similarity metric learning methods (Cao et al., 2012; Chechik et al., 2010), since they are based on pairwise or triplet-wise constraints and consider the following pairwise empirical


505

objective function: 1 m(m − 1)

m i, j=1,i= j

1 − yi y j (KA (xi , x j ) − r) + .

(3.1)

Our equation 2.2 is less restrictive since the empirical objective function is defined over an average of similarity scores and does not require the positive semidefiniteness of the similarity function K. Balcan et al. (2008) developed a theory of (, γ , τ )-good similarity function defined as follows. It attempts to investigate the theoretical relationship between the properties of a similarity function and its performance in linear classification: Definition 2 (Balcan & Blum, 2006). A similarity function K is a (, γ , τ )good similarity function in hinge loss for a learning problem P if there exists a random indicator function R(x) defining a probabilistic set of “reasonable points” such that the following conditions hold: 1. E(x,y)∼P [1 − yg(x)/γ ]+ ≤ , where g(x) = E(x ,y )∼P [y K (x, x )|R(x )]. 2. Pr x [R(x )] ≥ τ.

The first condition can be interpreted as “1 − proportion of points x are on average 2γ more similar to random reasonable points of the same class than to random reasonable points of the distinct classes” and the second condition as “at least a τ proportion of the points should be reasonable.” The following theorem implies that if given an (, γ , τ )-good similarity function and enough landmarks, there exists a separator α with error arbitrarily close to . Theorem 3 (Balcan & Blum, 2006). Let K be an (, γ , τ )-good similarity function in hinge loss for a learning problem P. For any 1 > 0 and 0 < δ ≤ γ 1 /4, let S = {x1 , . . . , xd } be a potentially unlabeled sample of dla nd = τ2 (log(2/δ) + la nd

log(2/δ)

16 ( γ )2 ) landmarks drawn from P. Consider the mapping φiS = K (x, xi ), i ∈ 1 {1, · · · , dla nd }. Then, with probability at least 1 − δ over the random sample S, the induced distribution φ S (P) in Rdla nd has a linear separator α of error at most + 1 at margin γ . Balcan et al. (2008) mentioned that the linear separator can be estimated by solving the following linear programming if we have du potentially unlabeled sample and dl labeled sample: min α

d

l

i=1

1−

d

u

j=1

α j yi K

xi , xj

d

: +

u

j=1

|α j | ≤ 1/γ .

(3.2)

506


Algorithm 3.2 is quite similar to the linear SVM, equation 2.6, that we use in this letter. Our work is distinct from Balcan et al. (2008) in two respects. First, the similarity function K is predefined in algorithm 3.2 while we aim to learn a similarity function KA from a regularized similarity learning z formulation, equation 2.2. Second, although the separators are both trained from the linear SVM, the classification algorithm, equation 3.2, used in Balcan et al. (2008) was designed using two different sets of examples: a set of labeled samples of size dl to train the classification algorithm and another set of unlabeled samples with size du to define the mapping φ S . In this letter, we use the same set of training samples for both similarity learning, equation 2.2, and the classification algorithm, equation 2.6. Recent work by Bellet et al. (2012) is mostly close to ours. Specifically, they considered similarity learning formulation (see equation 2.2) with the Frobenius norm regularization. Generalization bounds for similarity learning were derived using uniform stability arguments (Bousquet & Elisseeff, 2002), which cannot deal with, for instance, the L1 -norm and (2,1)-norm regularization terms. In addition, the results about the relationship between the similarity learning and the performance of the learned matrix in classification were quoted from Balcan et al. (2008) and hence requires two separate sets of samples to train the classifier. Most recently, there has been considerable interest in two-stage approaches for multiple kernel learning (Cortes, Mohri, & Rostamizadeh, 2010b; Kar, 2013), which perform competitively as the one-stage approaches (Lanckriet et al., 2004; Varma & Babu, 2009). In particular, Kar (2013) studied generalization guarantees for the following regularization formulation for learning similarity (kernel) function proposed in Kumar et al. (2012):

arg min μ≥0

2 m(m − 1)

m 1≤i< j≤n

λ 1 − yi y j Kμ (xi , x j ) + + (μ), 2

(3.3)

p where Kμ = =1 μ K is the positive linear combination of base kernels {K : = 1, 2, . . . , p}, λ > 0 and (·) is a regularization term that, for instance, can be the Frobenius norm or the L1 norm. Specifically, Kar (2013) established elegant generalization bounds for the above two-stage multiple kernel learning using techniques of Rademacher averages (Bartlett & Mendelson, 2002; Kakade, Sridharan, & Tewari, 2008; Kakade, ShalevShwartz, & Tewari, 2012) and U-statistics (Cao et al., 2012; Clémencon et al., 2008). The empirical error term, equation 2.1, in our formulation, equation 2.2, is not a U-statistics term, and the techniques in Kar (2013) and Cao et al. (2012) cannot directly be applied to our case. Kar and Jain (2011, 2012) introduced an extended framework of Balcan and Blum (2006) and Balcan et al. (2008) in the general setting of supervised learning. The authors proposed a general goodness criterion for similarity


507

functions, which can handle general supervised learning tasks and also subsumes the goodness of condition of Balcan et al. (2008). There, efficient algorithms were constructed with provable generalization error bounds. The main distinction between this work and our own work is that we aim to learn a similarity function, while in their work, a similarity function is defined in advance. 4 Generalization Bounds for Similarity Learning In this section, we establish generalization bounds for the similarity learning formulation equation 2.2, with general matrix-norm regularization terms. Recall that the true error for similarity learning is defined by E (A) =

1−

Z

1 r

Z

yy KA (x, x )dρ(x , y ) dρ(x, y). +

The target of generalization analysis for similarity learning is to bound the empirical error Ez (Az ) by the true error E (Az ). By definition of Az , we know that Ez (Az ) + λAz ≤ Ez (0) + λ0 = 1, which implies that Az ≤ 1/λ. Denote

A = A ∈ Sd : A ≤ 1/λ .

Hence, one can easily see that the solution Az to formulation 2.2 belongs to

A. Now we are ready to prove generalization bounds for similarity learning,

which was stated as theorem 1 in section 2. Proof of theorem 1. Our proof is divided into two steps. Step 1. Let Ez denote the expectation with respect to samples z. Observe that Ez (Az ) − E (Az ) ≤ supA∈A [Ez (A) − E (A)]. Also, for any z = (z1 , . . . , zk , . . . , zm ) and z˜ = (z1 , . . . , z˜k , . . . , zm ), 1 ≤ k ≤ m, there holds sup[Ez (A) − E (A)] − sup[Ez˜ (A) − E (A)] ≤ sup |Ez (A) − Ez˜ (A)| A∈A A∈A A∈A ⎧ m ⎨ 1 ≤ 2 sup |yi yk KA (xk , xi ) − yi y˜k KA (x˜k , xi )| m r A∈A ⎩ i=1,i=k

⎫ ⎬ + (yk y j KA (xk , x j ) − y˜k y j KA (x˜k , x j )) j∈N ⎭ m

2 4X∗ ≤ 2 sup . (|yi yk KA (xk , xi )| + |yi y˜k KA (x˜k , xi )|) ≤ m r A∈A mrλ i∈Nm

508


Applying McDiarmid’s inequality (McDiarmid, 1989) (see lemma 1 in the appendix) to the term supA∈A [Ez (A) − E (A)], with probability at least 1 − δ, there holds 2X∗ 2 log( 1δ ) sup[Ez (A) − E (A)] ≤ Ez sup[Ez (A) − E (A)] + . (4.1) rλ m A∈A A∈A Now we are in a position to estimate the first term in the expectation form on the right-hand side of equation 4.1 by standard symmetrization techniques. ! " Step 2. We divide the term Ez supA∈A Ez (A) − E (A) into two parts as follows: Ez sup[Ez (A) − E (A)] A∈A

1 1 = Ez sup yi y j KA xi , x j − E (A) 1− mr A∈A m

= Ez sup A∈A

i∈Nm

j∈Nm

+

1 1 1 − E(x ,y ) yi y KA xi , x m r i∈Nm

− E (A) +

1 1 − 1 − E(x ,y ) yi y KA xi , x m r i∈Nm

+

+

1 1 yi y j KA xi , x j 1− m mr i∈Nm

j∈Nm

+

≤ I1 + I2 , where

1 1 − E (A) , I1 := Ez sup 1 − E(x ,y ) yi y KA xi , x r A∈A m

i∈Nm

and

I2 := Ez sup A∈A

1 1 − 1 − E(x ,y ) yi y KA xi , x m r #

+

+

i∈Nm

+

1 1 yi y j KA xi , x j 1− m mr i∈Nm

j∈Nm

$ . +

Now let z¯ = {z¯1 , z¯2 , . . . , z¯m } be an i.i.d. sample that is independent of z. We first estimate I1 using the standard symmetrization techniques. To this end,


we rewrite E (A) as Ez¯

1

m

! i∈Nm

509

" 1 − 1r E(x ,y ) y¯i y KA (x¯i , x ) + . Then we have

1 1 I1 = Ez sup 1 − E(x ,y ) yi y KA xi , x r A∈A m i∈N m + # 1 1 − Ez¯ 1 − E(x ,y ) y¯i y KA x¯i , x m r i∈N + m 1 1 ≤ Ez,¯z sup 1 − E(x ,y ) yi y KA xi , x r A∈A m i∈N + m 1 1 − 1 − E(x ,y ) y¯i y KA x¯i , x . m r

i∈Nm

+

By the standard Rademacher symmetrization technique and the contraction property of the Rademacher average (see lemma 2 in the appendix), we have

1 1 σi 1 − E(x ,y ) yi y KA xi , x I1 ≤ 2Ez,σ sup r A∈A m i∈N m % & + 1 σi yi xi y xT dρ(x , y ), A ≤ 4Ez,σ sup mr A∈A i∈Nm 4 ≤ rλ Ez,σ m1 σi yi xi y xT dρ(x , y ) i∈Nm ∗ 1 4 ≤ rλ Ez,σ sup σi yi xi x˜T , x˜ m i∈Nm

∗

where the last inequality follows from the fact that A, B ≤ AB∗ ≤ 1 B∗ for any A ∈ A and B ∈ Rd×d . r Similarly, we can estimate I2 as follows: # 1 1 yi y j KA (xi , x j ) I2 = Ez sup 1− mr A∈A m i∈N j∈ N + m m $ 1 − 1 − E(x ,y ) yi y KA (xi , x ) r + $ # 1 1 1 ≤ Ez sup yi y j KA (xi , x j ) E(x ,y ) yi y KA xi , x − r mr A∈A m i∈Nm

j∈Nm

510


' ( 1 1 T T = Ez sup yi y j x j xi , A E(x ,y ) yi y x xi − m mr A∈A j∈Nm i∈Nm 1 1 T T ≤ Ez sup E y x x − y x x (x ,y ) j j rλ x∈X m ∗ j∈Nm 1 1 1 T T = Ez sup Ez y jx jx − y jx jx . rλ x∈X m m ∗ j∈Nm

j∈Nm

In this estimation, the first inequality follows from the Lipschitz continuity of the hinge loss function. Following the standard Rademacher symmetrization technique (Bartlett & Mendelson, 2002), from the above estimation we can further estimate I2 as follows: I2 ≤ ≤ ≤ ≤

1 T 1 1 T E sup E y jx jx − y jx jx rλ z x∈X z m m ∗ j∈Nm j∈Nm 1 T 1 1 E sup y jx jx − y j x j xT rλ z,z x∈X m m ∗ j∈Nm j∈Nm

1 1 E sup σ j yj xj xT − y j x j xT rλ z,z ,σ x∈X m ∗ j∈Nm 1 2 E sup σ j y j x j xT m . rλ z,σ x∈X

j∈Nm

∗

The desired result follows by combining equation 4.1 with the above estimation for I1 and I2 . This completes the proof for the theorem. 5 Guaranteed Classification via Good Similarity In this section, we investigate the theoretical relationship between the generalization error of the similarity learning and that of the linear classifier built from the learned similarity metric KA . In particular, we will show that the z generalization error of the similarity learning gives an upper bound for the generalization error of the linear classifier, which was stated as theorem 2 in section 2. Before giving the proof of theorem 2, we first establish the generalization bounds for the linear SVM algorithm, equation 2.6. Recall that the algorithm was defined by

1 fz = arg min 1 − yi f (xi ) + f ∈Fz m i∈Nm

s.t. ( f ) =: j∈N |α j | ≤ 1/r, m


where

Fz =

f : f =

511

α j KA (x j , ·), a j ∈ R . z

j∈Nm

The generalization analysis of the linear SVM algorithm, equation 2.6, aims to estimate the term E ( fz ) − Ez ( fz ). For any z, one can easily see that the solution to the algorithm belongs to the set Fz,r , where Fz,r =

f =

j∈Nm

α j KA (x j , ·) : ( f ) = z

|α j | ≤ 1/r, a j ∈ R .

j∈Nm

To perform the generalization analysis, we seek a sample-independent set that contains, for any z, the sample-dependent hypothesis space Fz . Specifi cally, we define a sample independent hypothesis space by Fm = f = αi KA (ui , ·) : A ≤ 1/λ, u j ∈ X, a j ∈ R . i∈Nm

Recalling that for any z, Az ≤ λ−1 , one can easily see that Fz is a subset of Fm . It follows that for any z, the solution to the linear SVM algorithm, equation 2.6, lies in the set Fm,r , which is given by Fm,r = f ∈ Fm : ( f ) ≤ 1/r . The following theorem states the generalization bounds of the linear SVM for classification: Theorem 4. Let f z be the solution to the algorithm, equation 2.6. For any 0 < δ < 1, with probability at least 1 − δ, we have 4Rm 2X∗ 2 log 1δ E ( f z ) − Ez ( f z ) ≤ + . (5.1) λr λr m Proof. By McDiarmid’s inequality, for any 0 < δ < 1, with confidence 1 − δ, there holds

E ( fz ) − Ez ( fz ) ≤ sup E ( f ) − Ez ( f ) ≤ sup E ( f ) − Ez ( f ) f ∈Fz,r

2X∗ ≤ Ez sup E ( f ) − Ez ( f ) + λr f ∈F m,r

f ∈Fm,r

2 log 1δ . m

512


Next, all we need is to estimate the first part of the right-hand side of the above inequality. Let z¯ be an independent sample (independent each other and z) and with the same distribution as z: Ez sup E ( f ) − Ez ( f ) f ∈F

m,r = Ez sup Ez¯ Ez¯ ( f ) − Ez ( f ) ≤ Ez,¯z sup Ez¯ ( f ) − Ez ( f )

f ∈Fm,r

f ∈F

m,r $ 1 σi 1 − α j yi KA (xi , u j ) ≤ 2Ez,σ sup sup A≤1/λ i∈N |αi |≤1/r m i∈N j∈Nm + m m $ # 1 σi α j yi KA (xi , u j ) ≤ 4Ez,σ sup sup A≤1/λ i∈N |αi |≤1/r m i∈N j∈ N m m m $ # 1 ) T * 4 ≤ Ez,σ sup sup σ i y i xi x , A r A:A≤1/λ x∈X m i∈N m 1 4 ≤ Ez,σ sup σ i yi xi xT . λr m x∈X

#

i∈Nm

∗

Here we also use the standard Rademacher symmetrization technique and the contractor property of the Rademacher average. Then the proof is complete. Now we are in a position to give the proof of theorem 2: y y 1

Proof of theorem 2. If we take α 0 = ( mr1 , . . . , mrm )T , then fz0 = mr j∈Nm y j KAz

1 0 0 (x j , ·). One can easily see that ( fz ) = j∈N |α j | = r , which means fz0 ∈ m Fz,r . From theorem 4 and the definition of fz , we get

0 4Rm 2X∗ 2 log 1δ 2X∗ 2 log 1δ 4Rm + ≤ Ez f z + + E ( f z ) ≤ Ez ( f z ) + λr λr m λr λr m

1 4Rm 2X∗ 2 log 1δ 1− yi y j KA (xi , x j ) + + = m1 z mr λr λr m + i∈Nm j∈Nm 4Rm 2X∗ 2 log 1δ = Ez (Az ) + + . λr λr m This completes the proof of the theorem. 6 Estimating Rademacher Averages Theorems 1, 2, and 4 critically depend on the estimation of the Rademacher average Rm defined by equation 2.4. In this section, we establish a


513

self-contained proof for this estimation and prove the examples listed in section 2. For notational simplicity, denote by xi the th variable of the ith sample xi ∈ Rd . Proof of example 1. The dual norm of L1 -norm is the L∞ -norm. Hence, 2 X∗ = sup sup |x (x )k | = sup x∞ . x,x ∈X ,k∈Nd

x∈X

(6.1)

Also, the Rademacher average can be rewritten as 1

Rm = Ez,σ sup x∈X m

j∈Nm

σ j y j x j xT

∞

1 ≤ sup x∞ Ez,σ max σ j y j xj . ∈Nd m x∈X

(6.2)

j∈Nm

Now let U (σ ) = m1 any η > 0, we have

j∈Nm

σ j y j xj , for any ∈ Nd . By Jensen’s inequality, for

! η2 (max∈N |U (σ )|)2 " d − 1 ≤ Eσ e −1 " ! η2 (|U (σ )|)2 " ! 2 2 Eσ e −1 . = Eσ max eη |U (σ )| − 1 ≤

η2 (Eσ max∈N |U (σ )|)2

e

d

∈Nd

(6.3)

∈Nd

Furthermore, for any ∈ Nd , there holds " 1 2k η Eσ |U |2k −1 = k! k≥1 1 k 2k 2eη2 Eσ |U |2 , ≤ η (2k − 1)k (Eσ |U |2 )k ≤ k! !

Eσ eη

2

(|U (σ )|)2

k≥1

k≥1

where the first inequality follows from the Khinchin-type inequality (see lemma 3 in the appendix) and the second inequality holds due to the Stir√ 1 ling’s inequality: e−k kk ≤ k!. Now set η = [2 e max∈N (Eσ |U |2 ) 2 ]−1 . The d above inequality can be upper-bounded by !

E eη

2

(|U (σ )|)2

" −k −1 ≤ 2 = 1, k≥1

∀ ∈ Nd .

514


Putting the above estimation back into equation 6.3 implies that

η2 E max∈N |U (σ )|

e

2 − 1 ≤ d.

d

That means 1

Eσ max |U (σ )| = Eσ max ∈Nd ∈Nd m

j∈Nm

σ j y j xj ≤ log(d + 1)η−2

2 1 = 2 e log(d + 1) max Eσ U 2 ∈Nd

2 12 1 n = 2 e log(d + 1) max Eσ σ j y j xj ∈Nd m j∈Nm

12 1 = 2 e log(d + 1) max Eσ 2 σ j σk y j yk xj xk ∈Nd m j,k∈Nm

1 1 2 2 = 2 e log(d + 1) max (x j ) ∈Nd m2 ≤ 2 sup x∞ x∈X

j∈Nm

e log(d + 1) . m

(6.4)

Putting the above estimation back into equation 6.2 implies that 2 Rm ≤ 2 sup x∞

x∈X

e log(d + 1) . m

The other desired results in the example follow directly from combining the above estimation with theorems 1 and 2. We now turn our attention to the similarity learning formulation, equation 2.2, with the Frobenius norm regularization: Proof of example 2. The dual norm of the Frobenius norm is itself. Consequently, X∗ = supx,x ∈X x xT F = supx∈X x2F . The Rademacher average can be rewritten as 1

Rm = Ez,σ sup x∈X m

j∈Nm

σ j y j x j xT . F


515

By Cauchy’s inequality, there holds Rm = Ez,σ

1 sup xF σ jy jx j m x∈X F j∈Nm

≤ sup xF Ez x∈X

2 12 1 Eσ σ jy jx j m F j∈Nm

1+ 2 2 2 1 xj F m ≤ sup xF √ . = sup xF E m x∈X x∈X j∈N

(6.5)

m

Then the desired results can be derived by combining the above estimation with theorems 1 and 2. The above generalization bound for the similarity learning formulation, equation 2.2, with the Frobenius norm regularization is consistent with that given in Bellet et al. (2012), where the result holds true under the assumption that supx∈X xF ≤ 1. Next, we provide the estimation of Rm respectively for the mixed (2,1)-norm and the trace norm: Example 3. Consider the similarity learning formulation,

equation 2.2, with the mixed (2,1)-norm regularization A(2,1) = k∈N ( ∈N |Ak |2 )1/2 . d d Then we have the following estimation: a. X∗ ≤ [supx∈X xF ][supx∈X x∞ ] and ! "! " e log(d + 1) . Rm ≤ 2 sup xF sup x∞ m x∈X x∈X b. For any 0 < δ < 1, with confidence at least 1 − δ, there holds "! " ! 12 supx∈X xF supx∈X x∞ e log(d + 1) Ez (Az )− E (Az )≤ rλ m "! " ! 2 log 1δ 2 supx∈X xF supx∈X x∞ . + rλ m (6.6) c. For any 0 < δ < 1, with probability at least 1 − δ, there holds "! " ! 4 supx∈X xF supx∈X x∞ 2e log(d + 1) E ( fz ) ≤ Ez (Az ) + λr m ! "! " 2 log 1δ 2 supx∈X xF supx∈X x∞ + . λr m

516


Proof. The dual norm of the (2,1)-norm is the (2, ∞)-norm, which implies that X∗ = supx,x ∈X x xT (2,∞) = supx∈X xF supx ∈X x ∞ and Ez,σ

1 1 T sup σ j y j x j x ≤ sup xF Ez,σ max σ jy jx j ∈Nd m x∈X m x∈X j∈Nm

j∈Nm

∗

≤ 2 sup xF sup x∞ x∈X

x

e log(d + 1) , m

where the last inequality follows from estimation 6.4. We complete the proof by combining the above estimation with theorems 1 and 2. We briefly discuss the case of the trace norm regularization, A = Atr . In this case, the dual norm of trace norm is the spectral norm defined, for any B ∈ Sd , by B∗ = max∈N σ (B) where {σ : ∈ Nd } are the singular values d

of matrix B. Observe, for any u, v ∈ Rd , that uv T ∗ = uF vF . Hence, the constant X∗ = supx,x ∈X x xT ∗ = supx∈X x2F . In addition, Rm = Ez,σ

1 T sup σ jy jx jx x∈X m j∈Nm

= Ez,σ

∗

1 sup xF σ jy jx j m x∈X j∈Nm

F

j∈Nm

F

1 = sup xF Ez,σ σ jy jx j m x∈X # ≤ sup xF Ez x∈X

# = sup xF E x∈X

2 $ 12 1 Eσ σ y x j j j m j∈Nm

2 x j F

F

, 1 2

m.

(6.7)

j∈Nm

Indeed, the above estimation for Rm is optimal. To see this, we observe ˜ and Giné (1999, theorem 1.3.2) that from de la Pena #

2 $ 12 1 1 √ Eσ σ jy jx j ≤ 2Eσ σ jy jx j . m m j∈Nm

F

j∈Nm

F


517

Combining the above fact with equation 6.7, we can obtain 1 Rm = sup xF Ez,σ σ jy jx j m x∈X j∈Nm F 2 $ 12 # 1 1 ≥ √ sup xF Ez Eσ σ jy jx j m 2 x∈X j∈Nm F # $ 12 2 1 x = √ sup xF E . j m 2 x∈X j∈N m

up to the conHence, the estimation, equation 6.7, for Rm is optimal

1 stant √12 . Furthermore, ignoring further estimation m1 E( j∈N x j 2 ) 2 ≤ m √1 sup x∈X xF , the above estimations mean that the estimation for Rm in m the case of trace-norm regularization is the same as the equation 6.5 estimation for the Frobenius norm regularization. Consequently, the generalization bounds for similarity learning and the relationship between similarity learning and the linear SVM are the same as those stated in example 2. It is a bit disappointing that there is no improvement when using the trace norm. The possible reason is that the spectral norm of B and the Frobenius norm of B are the same when B takes the form B = xyT for any x, y ∈ Rd . We end this section with a comment on an alternate way to estimate the Rademacher average Rm . Kakade et al. (2008, 2012) developed elegant techniques for estimating Rademacher averages for linear predictors. In particular, the following theorem was established: Theorem 5 (Kakade et al., 2008, 2012). Let W be a closed convex set, and let f : W → R be a β-strongly convex with respect to · and assume that f ∗ (0) = 0. Assume W ⊆ {w : f (w) ≤ f ma x }. Furthermore, let X = {x : x∗ ≤ X} and F = {w → w, x : w ∈ W , x ∈ X }. Then we have 2 f ma x Rn (F ) ≤ X . βn To apply theorem 5, we rewrite the Rademacher average Rm as 1 T Rm = Ez,σ sup σi yi xi x˜ m ˜ X x∈ ∗ i∈Nm

= Ez,σ sup

'

sup

˜ X A≤1,A∈Sd x∈

'

= Ez,σ sup

sup

˜ X A≤1,A∈Sd x∈

( 1 σi yi xi x˜T , A m i∈Nm

( 1 σi yi xi , Ax˜ . m i∈Nm

(6.8)

518


˜ A ≤ 1, A ∈

Now let W := {w → w, x, w = Ax, Sd }. Let us consider the 1 d sparse L -norm defined, for any A ∈ S , by A1 = k,∈N |Ak |. In this case, d ˜ ∞ ≤ A1 supx∈X x∞ ≤ supx∈X x∞ . Let we observe that w1 ≤ A1 x log d f (w) = w2q with q = log d−1 , which is ( log1 d )-strongly convex with respect to the norm · 1 . Then for any w ∈ W , we have that w2q ≤ w21 ≤ supx∈X x2∞ . Combining these observations with equation 6.8 allows 2 log d us to obtain the estimation Rm ≤ supx∈X x2∞ . Similarly, for the m

(2,1)-mixed norm A(2,1) = k∈N ( ∈N |Ak |2 )1/2 , observe that w1 ≤ d

d

˜ F ≤ supx∈X xF . Applying theorem 5 with A(2,1) x (q =

f (w) = w2q

log d ) log d−1

again, we will have the estimation Rm ≤ supx∈X xF 2 log d supx∈X x∞ . Hence, the estimations for the above two cases are m similar to our estimations in the above examples. Our estimation is more straightforward by directly using the Khinchin-type inequality in contrast to the advanced convex-analysis techniques used in Kakade et al. (2008, 2012). However, for the case of trace-norm regularization (A = Atr ), one would expect, using the techniques in Kakade et al. (2008, 2012), that the estimation for Rm is the same as that in the case for the sparse L1 -norm. ˜ 1 by The main hurdle for such a result is the estimation of w1 = Ax the trace norm of A. Indeed, by the discussion following our estimation, equation 6.7, directly using Khinchin-type inequality, we know that our estimation is optimal. Hence, one cannot expect that the estimation for Rm for the case for trace-norm regularization is the same as that in the case for sparse L1 -norm regularization in our particular case of the similarity learning formulation, equation 2.2. We end this section with an open question. It is not clear to us how to establish a generic result for estimating the interesting Rademacher average Rm given by equation 6.8. Such a generic result is expected to be very similar to the result stated above as theorem 5, which was established by Kakade et al. (2008, 2012). The main advantage of establishing such a generic result would enable a unifying estimation of Rm for different matrix norms, which can then be instantiated into examples 1, 2, and 3. 7 Conclusion In this letter, we have considered a regularized similarity learning formulation, equation 2.2. Its generalization bounds were established for various matrix-norm regularization terms such as the Frobenius norm, sparse L1 norm, and mixed (2,1)-norm. We proved that the generalization error of the linear classifier based on the learned similarity function can be bounded by the derived generalization bound of similarity learning. This guarantees the goodness of the generalization of similarity learning (see equation 2.2) with


519

general matrix-norm regularization and thus the classification generalization of the resulting linear classifier. Our techniques using the Rademacher complexity (Bartlett & Mendelson, 2002) and the important Khinchin-type inequality for the Rademacher variables allow us to obtain new bounds for similarity learning with general matrix-norm regularization terms. There are several possible directions for future work. First, we may consider similarity algorithms with general loss functions. It is expected that under some convexity conditions on the loss functions, better results could be obtained. Second, we usually focus on the excess misclassification error when considering classification problems. Hence, in the future, we would like to consider the theoretical link between the generalization bounds of the similarity learning and the excess misclassification error of the classifier built from the learned similarity function. Appendix In this appendix, the following facts are used for establishing generalization bounds in sections 4 and 5. Definition 3. We say the function f : m k=1 Ωk → R with bounded differences m {c k }k=1 if, for all 1 ≤ k ≤ m, max

z1 ,···,zk ,zk ···,zm

| f (z1 , . . . , zk−1 , zk , zk+1 , . . . , zm ) − f (z1 , . . . , zk−1 , zk , zk+1 , . . . , zm )| ≤ c k

Lemma 1 (McDiarmid’s inequality—McDiarmid, 1989). Suppose f : -m Ω → R with bounded differences {c k }m k=1 k k=1 . Then for all > 0, there holds 2 − m2 2 Prz f (z) − Ez f (z) ≥ ≤ e k=1 ck .

We need the following contraction property of the Rademacher averages, which is essentially implied by theorem 4.12 in Ledoux and Talagrand (1991: see also Bartlett & Mendelson, 2002; Koltchinskii & Panchenko, 2002). Lemma 2. Let F be a class of uniformly bounded real-valued functions on (Ω, μ) and m ∈ N. If for each i ∈ {1, . . . , m}, φi : R → R is a function having a Lipschitz constant ci , then for any {xi }i∈N , m

E sup f ∈F

i∈Nm

i φi ( f (xi )) ≤ 2E sup f ∈F c i i f (xi ) . i∈Nm

(A.1)

520


Another important property of the Rademacher average, which is used in the proof of the generalization bounds of the similarity learning, is the ˜ & Giné, 1999, theorem following Khinchin-type inequality (see de la Pena 3.2.2): Lemma 3. For n ∈ N, let { f i ∈ R : i ∈ Nn }, and {σi : i ∈ Nn } be a family of i.i.d. Rademacher random variables. Then for any 1 < p < q < ∞, we have

Eσ σi i∈Nn

q q1

1 q −1 2 fi ≤ Eσ σi p−1 i∈Nn

p 1p f i .

Acknowledgments We are grateful to the referees for their invaluable comments and suggestions on this letter. This work was supported by the EPSRC under grant EP/J001384/1. References Balcan, M.-F., & Blum, A. (2006). On a theory of learning with similarity functions. In Proceedings of the 23rd International Conference on Machine Learning (pp. 73–80). New York: ACM Press. Balcan, M.-F., Blum, A., & Srebro, N. (2008). Improved guarantees for learning via similarity functions. In Proceedings of the 21st Annual Conference on Learning Theory (pp. 287–298). Madison, WI: Omnipress. Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a Mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937– 965. Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482. Bellet, A., Habrard, A., & Sebban, M. (2012). Similarity learning for provably accurate sparse linear classification. In Proceedings of the 27th International Conference on Machine Learning (pp. 1871–1878). New York: ACM Press. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526. Cao, Q., Guo, Z.-C., & Ying, Y. (2012). Generalization bounds for metric and similarity learning. Unpublished manuscript. Chechik, G., Sharma, V., Shalit, U., & Bengio, S. (2010). Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11, 1109–1135. Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., & Cazzanti, L. (2009). Similaritybased classification: Concepts and algorithms. Journal of Machine Learning Research, 10, 747–776. Clémencon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. Annals of Statistics, 36, 844–874.


521

Cortes, C., Mohri, M., & Rostamizadeh, A. (2010a). Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (pp. 247–254). New York: ACM Press. Cortes, C., Mohri, M., & Rostamizadeh, A. (2010b). Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on Machine Learning (pp. 239–246). New York: ACM Press. Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Information-theoretic metric learning. In Proceedings of the 24th International Conference on Machine Learning (pp. 209–216). New York: ACM Press. ˜ V. H., & Giné, E. A. (1999). Decoupling: From dependence to independence. de la Pena, New York: Springer. Hoi, S. C., Liu, W., Lyu, M. R., & Ma, W. Y. (2006). Learning distance metrics with contextual constraints for image retrieval. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 2072–2078). Piscataway, NJ: IEEE Press. Jin, R., Wang, S., & Zhou, Y. (2009). Regularized distance metric learning: Theory and algorithm. In Y. Bengio, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22 (pp. 862–870). Cambridge, MA: MIT Press. Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2012). Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13, 1865– 1890. Kakade, S. M., Sridharan, K., & Tewari, A. (2008). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (Eds.), Advances in neural information processing systems, 21 (pp. 793–800). Cambridge, MA: MIT Press. Kar, P. (2013). Generalization guarantees for a binary classification framework for two-stage multiple kernel learning. CoRR abs/1302.0406. Kar, P., & Jain, P. (2011). Similarity-based learning via data driven embeddings. In Advances in neural information processing systems (pp. 1998–2006). Cambridge, MA: MIT Press. Kar, P., & Jain, P. (2012). Supervised learning with similarity functions. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.), Advances in neural information processing systems, 24 (pp. 215–223). Red Hook, NY: Curran. Koltchinskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1), 1–50. Kumar, A., Niculescu-Mizil, A., Kavukcoglu, K., & Daum, H. (2012). A binary classification framework for two-stage multiple kernel learning. In Proceedings of the 29th International Conference on Machine Learning (pp. 1295–1302). New York: ACM Press. Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5, 27–72. Ledoux, M., & Talagrand, M. (1991). Probability in Banach spaces: Isoperimetry and processes. New York: Springer.

522


Maurer, A. (2008). Learning similarity with operator-valued large-margin classifiers. Journal of Machine Learning Research, 9, 1049–1082. McDiarmid, C. (1989). Surveys in combinatorics. Cambridge: Cambridge University Press. Saigo, H., Vert, J. P., Ueda, N., & Akutsu, T. (2004). Protein homology detection using string alignment kernels. Bioinformatics, 20, 1682–1689. Smola, A. J., Ovari, Z. L., & Williamson, R. C. (2001). Regularization with dot-product ´ kernels. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 308–314). Cambridge, MA: MIT Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Varma, M., & Babu, B. R. (2009). More generality in efficient multiple kernel learning. In Proceedings of the 26th International Conference on Machine Learning (pp. 1065– 1072). New York: ACM Press. Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2005). Distance metric learning for large margin nearest neighbor classification. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1473–1480). Cambridge, MA: MIT Press. Wu, Q. (2013). Regularization networks with indefinite kernels. Journal of Approximation Theory, 166, 1–18. Wu, Q., & Zhou, D. X. (2005). SVM soft margin classifiers: Linear programming versus quadratic programming. Neural Computation, 17, 1160–1187. Xing, E. P., Jordan, M. I., Russell, S., & Ng, A. (2002). Distance metric learning ¨ & with application to clustering with side-information. In S. Becker, S. Thrun, K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 505–512). Cambridge, MA: MIT Press. Ying, Y., & Campbell, C. (2009). Generalization bounds for learning the kernel. In Proceedings of the 22nd Annual Conference on Learning Theory. Ying, Y., Campbell, C., & Girolami, M. (2009). Analysis of SVM with indefinite kernels. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22 (pp. 2205– 2213), Cambridge, MA: MIT Press. Ying, Y., Huang, K., & Campbell, C. (2009). Sparse metric learning via smooth optimization. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22 (pp. 2214–2222). Cambridge, MA: MIT Press.

Received June 13, 2013; accepted September 29, 2013.

This article has been cited by: 1. Aurélien Bellet, Amaury Habrard. 2015. Robustness and generalization for metric learning. Neurocomputing 151, 259-267. [CrossRef]

Manifold regularized multitask feature learning for multimodality disease classification.

Materials Prediction via Classification Learning.

Alternating proximal regularized dictionary learning.

Learning regularized LDA by clustering.

A general framework for regularized, similarity-based image restoration.

CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING.

Similarity Learning of Manifold Data.

A regularized approach for geodesic-based semisupervised multimanifold learning.

Network-constrained forest for regularized classification of omics data.

Pediatric readmission classification using stacked regularized logistic regression models.

Predicting gene function using similarity learning.

EEG-based emotion recognition with manifold regularized extreme learning machine.

Multimodal manifold-regularized transfer learning for MCI conversion prediction.

Subspace Regularized Sparse Multitask Learning for Multiclass Neurodegenerative Disease Identification.

Regularized machine learning in the genetic prediction of complex traits.

Coupled attribute similarity learning on categorical data.

Similarity regularized sparse group lasso for cup to disc ratio computation.

Job satisfaction guaranteed.

Personalized microbial network inference via co-regularized spectral clustering.

Scalable Convex Multiple Sequence Alignment via Entropy-Regularized Dual Decomposition.

Visual Tracking via Weighted Local Cosine Similarity.

Visual tracking via discriminative sparse similarity map.

Systematic classification of non-coding RNAs by epigenomic similarity.

Lateral flagella of vibrios: serological classification and genetical similarity.