1134

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

Generalized Multiple Kernel Learning With Data-Dependent Priors Qi Mao, Ivor W. Tsang, Shenghua Gao, and Li Wang

Abstract— Multiple kernel learning (MKL) and classifier ensemble are two mainstream methods for solving learning problems in which some sets of features/views are more informative than others, or the features/views within a given set are inconsistent. In this paper, we first present a novel probabilistic interpretation of MKL such that maximum entropy discrimination with a noninformative prior over multiple views is equivalent to the formulation of MKL. Instead of using the noninformative prior, we introduce a novel data-dependent prior based on an ensemble of kernel predictors, which enhances the prediction performance of MKL by leveraging the merits of the classifier ensemble. With the proposed probabilistic framework of MKL, we propose a hierarchical Bayesian model to learn the proposed data-dependent prior and classification model simultaneously. The resultant problem is convex and other information (e.g., instances with either missing views or missing labels) can be seamlessly incorporated into the datadependent priors. Furthermore, a variety of existing MKL models can be recovered under the proposed MKL framework and can be readily extended to incorporate these priors. Extensive experiments demonstrate the benefits of our proposed framework in supervised and semisupervised settings, as well as in tasks with partial correspondence among multiple views. Index Terms— Data fusion, dirty data, missing views, multiple kernel learning, partial correspondence, semisupervised learning.

I. I NTRODUCTION EARNING with multiple sets of features (denoted as multiple views) attracts a great deal of attention in the literature. Several learning paradigms have been proposed to deal with this issue. Multiview learning [1], [2] focuses on the problem in which data examples are represented by multiple independent sets of features for the purpose of improving each view classifier through others. It is extremely helpful in the semisupervised setting where unlabeled data is used to ensure consistency among the hypotheses from each view [3], [4]. More recently, multiview learning with partially observed views has been studied [5], [6]. In this case, some

L

Manuscript received May 2, 2013; revised December 23, 2013 and June 23, 2014; accepted June 24, 2014. Date of publication July 25, 2014; date of current version May 15, 2015. This work was supported by the Australian Research Council Future Fellowship under Grant FT130100746. Q. Mao is with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA (e-mail: [email protected]). I. W. Tsang is with the Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney, Ultimo NSW 2007, Australia (e-mail: [email protected]). S. Gao is with ShanghaiTech University, Shanghai 200031, China (e-mail: [email protected]). L. Wang is with the Department of Mathematics, University of California at San Diego, La Jolla, CA 92093 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2334137

views of one instance are missing. We name the instance with partially observed views the partially correspondent instance, as opposed to a fully correspondent instance. The implicit assumption here is that each view alone has sufficient information to learn a discriminative classifier, referred to as a view classifier. However, this assumption is quite restrictive because some views might not be as informative as others. To tackle the discrepancy of multiple views, multiple kernel learning (MKL) [7], [8] has become a popular learning paradigm for combining multiple views in the supervised setting, in which base kernels are constructed from each view. The weighted combination of these kernels can be automatically learned from training data [8]. MKL is particularly effective for many diverse views because errors incurred by one view can be rectified by other views [9], and it has been successfully applied in many recognition tasks [10], [11]. The strategy of combining base kernels requires that each instance must include all views, otherwise it is infeasible for MKL owing to the different sizes of the base kernels. Hence, MKL cannot be directly used to model the partially observed views. A simple way is to remove all the instances with missing views, but much useful information will be lost as a result of this reduction. Another way to handle the discrepancy of multiple views is the classifier combination (ensemble) approach [12]–[14] which has also been explored to combine multiple view classifiers to achieve a robust prediction. Because view classifiers are learned independently, they can take full advantage of the data and views of interest. This in some sense avoids the negative effect of other views. Moreover, the individual learning process also allows various learning models [e.g., support vector machine (SVM) trained in the supervised setting in which some instances with missing views can be dropped without any loss of information, and LapSVM [15] in the semisupervised setting, where a small number of labeled data and a large amount of unlabeled data are available] to be used for each view. In this paper, we introduce a new model which takes advantage of both approaches and learns a better classification model than either of them. More importantly, the new model can be easily adapted for other settings without information loss, such as that might be caused by dirty data with partially correspondent instances and missing labels, which are very challenging to existing MKL methods. To the best of our knowledge, this paper presents the first attempt to marry MKL and an ensemble of kernel predictors for learning from multiple views.

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

1135

Fig. 1. Examples of partial correspondence problems. Each row of the largest rectangle is represented by M feature functions over each input object  x, x), ψ2 ( x), . . . , ψ M ( x)]. Gray region in mth: Existence of ψm ( x). White region: Nonexistence of ψm ( x). (a) Full correspondence. (b) Partial that is, x = [ψ1 ( correspondence.

Specifically, we present a hierarchical Bayesian model where MKL is formulated on the fully correspondent instances with the view classifiers from the ensemble method as the prior. As previously noted, the prior from ensemble methods can be learned from either the supervised setting with partial views or the semisupervised setting with missing labels, so the proposed method is also readily adaptable to these settings. Extensive experiments demonstrate that our proposed method outperforms all the baseline methods including state-of-the-art MKL and the ensemble of kernel predictors in both supervised and semisupervised settings. The main contributions of this paper are listed as follows. 1) We discover that MKL is coincident with the maximum entropy discrimination with a noninformative prior in the multiview setting. This discovery inspires a new probabilistic framework of MKL. 2) A novel data-dependent prior is introduced in terms of the view classifiers. The framework with the proposed data-dependent prior can naturally integrate MKL with an ensemble of kernel predictors as the prior. 3) We propose to learn the new prior from data by hierarchical Bayesian modeling so that the sparsity of coefficients in both MKL and the ensemble of kernel predictors can be controlled flexibly. The resultant problem is convex. 4) A variety of existing MKL formulations and the ensemble of kernel predictors are special cases of the proposed framework. We also provide the method of naturally integrating these MKL formulations with an ensemble of kernel predictors. 5) The proposed MKL framework (including other existing MKL models) can now be easily adapted for partially correspondent problems and semisupervised learning. This paper is structured as follows. A new generalized problem setting is presented for MKL in Section II. In Section III, a novel probabilistic framework of MKL is presented. In Section IV, a hierarchical Bayesian model is proposed to learn a new data-dependent prior in a hierarchical structure. The extensions of various MKL methods and other related MKL works are discussed in Sections V and VI, respectively. Experimental results are shown in Section VII. We conclude this paper in Section VIII.

II. G ENERALIZED L EARNING S ETTING We tackle the supervised classification problems where multiple feature functions are given in advance. Suppose that the training data set {(x1 , y1 ), . . . , (x N , y N )} with a collection of feature functions {ψ1 , . . . , ψ M } is given, where xi ∈ R D and yi ∈ {−1, +1}, ∀i . Here, ψm (x) is induced by the RKHS Hm for any given input x such that the inner product over xi and x j in Hm can be calculated by a certain kernel function, that is, κm (xi , x j ) = ψm (xi ), ψm (x j )Hm . Each input object  x is represented by a set of feature functions {ψ1 , ψ2 , . . . , ψ M } with M views, so the mth view for the input object  x is denoted as ψm ( x). The feature representation of  x is denoted by x = [ψ1 ( x), ψ2 ( x), . . . , ψ M ( x)]. In this example, we are going to deal with a generalized learning setting where each instance can be represented by multiple views, but all views may not perform equally well. This is similar to MKL setting, but we do not assume that each instance must have full set of views. In other words, we allow that some views of one instance are missing. This is quite similar to multiview learning with partially observed views [5], [6], but the views in this case might not consistently perform well and all views are involved in the testing process without any loss of information. We also allow for missing labels, that is, unlabeled data from any given instance of the training data. Fig. 1 shows the two different settings. Next, we will discuss these variants in detail. A. Full Correspondence Setting Every instance must be represented by the full set of views. In other words, views are correspondent according to the instance they are associated with. Fig. 1(a) shows this setting. This requirement naturally holds for a multiple kernel learning setting where a set of predefined kernel functions is generally applied to the full set or a subset of original features. Hence, all views have to be provided in advance. It is worth noting that a fully correspondent instance means that this instance must have a full set of views. Similarly, a fully correspondent data set means that all instances in this data set must have full set of views. B. Partial Correspondence Setting Each instance is represented by a set of views but some views may be missing randomly [as shown in Fig. 1(b)].

1136

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

This setting happens when the original input objects are not available because of either copyright issues or neglect following the feature extraction procedure. This setting is similar to the data collection process in the multiview learning with partially observed views except that the test instances are assumed to be fully correspondent. This is reasonable when the test input objects are given and one can extract the full set of views from the test input objects. As the counterpart of multiview learning, this setting is MKL with partially observed views. One interesting application is to merge the preprocessed features of the same data sets from different research groups. Unfortunately, the input objects are not given because of privacy or copyright issues; also, indexes associated with some instances are missing or incorrect when processing is conducted over a subset of a randomly sampled set of the original data by different groups. Hence, we do not have enough information to merge different views for one instance because of the missing view correspondence of some instances. Fortunately, we may have access to a small subset of input objects as exemplars, so we can construct a small set of fully correspondent instances. It is also possible that the labels of some instances are missing. Fig. 2 shows the learning diagram in the generalized setting from the merged data sets with missing views and labels. Except for feature views, some may only provide a kernel matrix or similarity matrix over a subset of instances without indexing. It is usually difficult for machine learning methods to explore all these special forms of data. In the following experiments, we simulate the removal of view correspondence by setting the index of one instance as ?. In other words, this instance leads to M virtual instances with only one view available and others missing by breaking the correspondence from a fully correspondent instance. The example is shown in Fig. 2. Even though the second instance in group 2 has the index 2, we treat it as two different instances after merging. This reorganization is equivalent to removing the correspondence of views in this instance which will be used in Section VII-B for the simulation of the partial correspondence setting. Next, we will propose a unified framework to learn a model from the generalized setting by exploring instances from both full correspondence and partial correspondence. Before discussing the proposed learning method, we present a novel probabilistic viewpoint of MKL.

Fig. 2. Learning diagram in a data-merging setting where both views and labels are allowed to be missing. In this example, three groups are provided as the preprocessed features for different subsets of input objects. The integer in the left hand side is the index of the input object in the target data sets. ?: Missing index of an input object. y: Class labels.

From the probabilistic point of view, we consider that θ is sampled from the joint density function p() over random variables . The expectation of the decision values f (x; θ) over all θ sampled from p() is used to predict the label of the given instance x  f (x; θ) p(θ )dθ (2) Eθ∼ p()[ f (x; θ)] = which is beneficial for reducing the risk of overfitting via model averaging [16]. To parameterize the joint density function p(), multivariate normal distribution with zeromean is generally M N (θ m |0, σm I) used [17], [18]. One can assume p(θ) = m=1 where σm I is the covariance matrix with σm ≥ 0 and I is the identity matrix. In this paper, we do not assume the specific expression of p(θ), but a prior distribution p0 (θ ) as p0 (θ ) =

III. MKL: A N OVEL P ROBABILISTIC I NTERPRETATION Following the notations in Section II, we further define the feature function φm (x) = [ψm (x); 1] with a constant value 1 augmented at the end of the mth feature function ψm , ∀m. The augmented kernel function becomes K m (xi , x j ) = φm (xi ), φm (x j ) = κm (xi , x j ) + 1. It is worth noting that the feature function can be applied on a subset of features [7]. The linear discriminative function with a given model θ = [θ 1 ; . . . ; θ M ] is formulated as the feature concatenation f (x; θ) = θ , (x) =

M  m=1

where (x) = [φ1 (x); . . . ; φ M (x)].

θm , φm (x)

(1)

M 

N (θ m |0, σm I)

(3)

m=1

which might be different from p(θ). We propose to learn p() by minimizing the empirical risk and making p(θ ) and p0 (θ ) as close as possible [19]. In other words, the specific optimization problem is written as min

p()∈P ,ξ ≥0

KL( p()|| p0()) + C

N 

ξi

i=1

s.t. yi Eθ∼ p() [ f (xi ; θ )] ≥ 1 − ξi ∀i (4)  with P = { p()| p(θ )dθ  = 1, p(θ ) ≥ 0}, the relative entropy KL( p() p0()) = p(θ) log ( p(θ)/ p0 (θ))dθ , and the tradeoff parameter C > 0. This resembles Bayesian

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

regularization [20] in maximum likelihood estimation. Problem (4) is similar to soft-margin SVM with slack variables ξi for error tolerance. The following proposition gives the explicit representation of the linear discriminative function. Proposition 1: Problem (4) with the prior (3) is equivalent to the following optimization problem: max − α∈A

M 1 σm α T (K m yy T )α + 1T α 2

(5)

m=1

and the corresponding discriminative function is Eθ∼ p() [ f (x; θ)] =

M 

σm  m (α), φm (x)

(6)

m=1

where α = [α1 ; . . . ; α N ] is the vector of dual variables, 1 is the vector of all ones, the set of α is A = {α ∈ R N |0 ≤ α ≤ C1}, y = [y1 ; . . . ; y N ], the operator is an element-wise product, N and m (α) = i=1 αi yi φm (xi ). Proof: First, we obtain the dual problem of (4) by Lagrangian duality. The Lagrangian function with dual variables α ≥ 0 and τi ≥ 0 can be formulated as  N  p(θ) dθ + C ξi L( p(), ξ ; α, τ ) = p(θ) log p0 (θ ) i=1 

 N N  − αi yi θ , (xi ) p(θ )dθ − 1 + ξi − τi ξi . i=1

i=1

By setting derivatives with respect to primal variables p(θ) and ξ to zeros, we obtain the following KKT conditions with the probability distribution constraint over p():  1 p0 (θ ) exp p(θ) = αi yi θ , (xi ) Z (α) i





C1 − α + τ 1 = 0

(7)

where Z (α) = p0 (θ ) exp( i αi yi θ , (xi ))dθ is the partition function. Substituting them back to Lagrangian function, we obtain its dual problem max − log Z (α) + 1T α. α∈A

Let m (α) = into Z (α) as Z (α) = =

N

i=1

M   m=1 M  m=1

(8)

αi yi φm (xi ). The prior (3) is substituted

N (θ m |0, σm I) exp(θ m , m (α))dθ m

exp

σ

m  m (α), m (α) . 2

=

M  m=1

σm  m (α), yi φm (xi )

from the definition of partition function and (9), respectively. Clearly, the expectation of θ m over p() is Eθ∼ p()[θ m ] = σm m (α). By substituting it into (2), we obtain (6). The proof is completed. According to Proposition 1, we have the following Corollary to indicate that the proposed framework is indeed a probabilistic interpretation of MKL. Corollary 1: Given the prior distribution (3) with fixed σm , ∀m, Problem (4) is a formulation of MKL where σm is the weight coefficient of the mth kernel K m . Although a novel probabilistic interpretation of MKL is revealed here, we are still confronted by two critical questions: 1) Can we construct a new prior to model the partial correspondence problem instead of the noninformative prior (3)? 2) How can we learn this prior from data? In the following, we will propose a method for tackling two problems in a unified framework. IV. BAYESIAN MKL W ITH DATA -D EPENDENT P RIOR We first explain why the prior (3) is noninformative, and then introduce a new data-dependent prior for modeling the partial correspondence problem. We then propose to learn the new prior via hierarchical Bayesian modeling. A. New Data-Dependent Prior As previously mentioned, the prior distribution p0 () plays an important role in the probabilistic modeling of MKL since the estimated distribution p() is constrained to be as close to p0 () as possible. To further investigate this prior, we make the following observation: Eθ∼ p0 () [ f (x; θ)] =

M 

Eθ∼ p0 ()[θ m ], φm (x) = 0

m=1

where the second equality follows the multivariate normal distribution with zero mean (3). In other words, if the prior (3) is used for prediction, the decision value for any input x will be 0. Hence, this prior does not take the input into account. We call this prior a noninformative prior. We construct a data-dependent prior by assuming the multivariate normal distribution with nonzero mean, that is, Eθ∼ p0 () [θ m ] = 0, ∃m. This inspires us to define the following prior: M  p0 (θ ) = N (θ m |μm β m , σm I) (10) m=1

(9)

By substituting it back to (8), we obtain (5). To obtain (6), we first compute the following derivatives over log Z (α) : ∂αi log Z (α) = Eθ∼ p() [θ], yi (xi )

1137

where β m is a feature vector in the RKHS Hm with an augmented feature 1 at the end and μm ∈ R is a parameter to control the sparsity of the mean of p0 (θ ). Therefore, we obtain Eθ∼ p0 ()[θ m ] = μm β m which might not be zeros for some m. As a result, the decision function becomes Eθ∼ p0 ()[ f (x; θ)] =

M 

μm β m , φm (x)

(11)

m=1

where β = {β 1 , . . . , β M } are predefined parameters. We will discuss the interpretation and assignment of β in Section IV-B.

1138

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

We have the following proposition when the new prior is incorporated into the proposed framework. Proposition 2: Following the representation of Proposition 1, (4) with the prior (10) is equivalent to the following optimization problem:  M  1 T σm α Q m α + μm β m , m (α) + 1T α (12) max − 2 α∈A m=1

and the corresponding discriminative function is Eθ∼ p() [ f (x; θ)] =

M 

 θ m , φm (x)

(13)

m=1

where  θ m = σm m (α) + μm β m and Q m = K m yy T . Proof: The proof can be derived similarly to that of Proposition 1. The following differences are: 1) the partition function M   Z (α) = N (θ m |μm βm , σm I) exp(θm , m (α))d =

m=1 M 

exp

σ

m=1

m

2

 m (α), m (α) + μm β m , m (α)

and 2) its derivative ∂αi log Z (α) =

M  

 σm m (α) + μm β m , yi φm (xi ) .

EKP. Further interpretation can be made on the equivalence to 1 -SVM on the feature vector fi formed by the decision values of instance xi on all the kernel functions ψm , ∀m and the chosen model {wm , bm }, that is, fi = [ f i1 , . . . , fiM ] where f im = wm , ψm (xi )Hm + bm . Therefore, the proposed framework with the new prior can employ both the original features and the generated features from view classifiers. 2) Semisupervised Setting: Once β m is assigned as the model of the mth view classifier, kernel-based classifiers in different learning settings can be used. In semisupervised learning, we can learn a classifier given a small number of labeled data and a large number of unlabeled data. The unlabeled data can be readily explored in our learning framework through β by the prior (10). Given the set of labeled data {(x1 , y1 ), . . . , (x N , y N )} and a set of unlabeled data {x N+1 , . . . , x N }, the classifier learned by most of semisupervised learning methods [15], [22] can be represented by N m wm = i=1 ηi ψm (xi ) in terms of a representer theorem. Similarly, the decision values fi = [ f i1 , . . . , fiM ] where f im = wm , ψm (xi )Hm +bm for the i th labeled instance over ψm can be calculated from the learned model {wm , bm }. Nevertheless, Proposition 2 does not make it possible to obtain the parameters μ and σ , and EKP needs a separate data set to learn μ in the second stage. In the following section, we propose to learn the distributions of μ and σ from data simultaneously with the model parameter α given β chosen by EKP in the first stage.

m=1

By replacing the corresponding terms in Proposition 1, we can readily obtain (12) and (13). B. Interpretation of β The prior distribution (10) crucially depends on the assignment of parameter β. Next, we will give details of the interpretation and assignment of β from the novel learning paradigm of partial correspondence and its extension to semisupervised learning. 1) Partial Correspondence: As discussed in Section I, view classifiers can explore all instances in the training data regardless of view correspondences. By considering β m = [wm ; bm ], ∀m, we observe that (11) is coincident with the hypothesis of ensembles of kernel predictors (EKP) [14] where {wm , bm } are solutions of SVMs on ψm with the tradeoff parameter Cm > 0  1 (yi , wm , ψm (xi )Hm +bm ). min ||wm ||2 +Cm wm 2 N

(14)

i=1

SVM with hinge loss [21] is used to solve (14) on the mth view and Cm is chosen by cross validation on training data. In fact, this is the first step of EKP [14]. The detailed connections of the proposed framework to EKP are presented in Section V. If β m = [wm ; bm ] is set by EKP, the proposed framework with the new data-dependent prior can be interpreted as searching a better model in between MKL and EKP because the framework (4) aims to minimize the empirical risk of MKL and be as close to EKP as possible. It is also clear to see that the discriminative function (13) combines that of MKL and

C. Hierarchical Bayesian Estimation It is rather difficult to set parameters μ = [μ1 ; . . . ; μ M ] and σ = [σ1 ; . . . ; σ M ] by hand, so we propose to learn them from the training data. We consider them as hidden random variables by following certain distributions which are parameterized by hyperparameters. As discussed above, σ and μ are all nonnegative. σ stands for the coefficient vector of the combination of kernels and μ is the coefficient vector of the combination of view classifiers. Depending on the property of the given data, both coefficient vectors can be sparse or nonsparse. Hence, it is appropriate to use exponential distribution here since its hyperparameter can control the mass probability lying near to or far from zero. M p0 (σm |λ) with Consequently, we define p0 (σ |λ) = m=1 exponential distributions p0 (σm |λ) = λ exp(−λσ m ), σm ≥ M p0 (μm |γ ) with expo0, λ > 0, ∀m, and p0 (μ|γ ) = m=1 nential distributions p0 (μm |γ ) = γ exp(−γ μm ), μm ≥ 0, γ > 0, ∀m. The hierarchical prior distribution is formulated as   p0 () = p0 (|μ, σ ) p0 (σ |λ) p0 (μ|γ ) dμ dσ (15) where p0 (|μ, σ ) is the new data-dependent prior (10). In this case, we are going to learn the distributions of both μ and σ instead of real scalars. Following Proposition 1, we derive Proposition 3 for the proposed framework with the hierarchical prior. Proposition 3: Problem (4) with the hierarchical prior (15) and exponential distributions p0 (σ |λ), p0 (μ|γ ) is equivalent

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

Algorithm 1 Sequential Quadratic Programming for Maximum Entropy Discrimination With Feature Concatenation 1: Input: strictly feasible solution α = 0 2: repeat 3: Obtain  by solving quadratic programming (19) 4: if |∇g(α)T /2| < 10−6 then 5: break; 6: end if 7: Choose step size s by backtracking line search 8: Update α = α + s 9: until Convergence

1139

By replacing the corresponding terms in Proposition 1, we can readily obtain (16) and (17). The convexity of (16) can readily be verified by composition property [23]. According to Proposition 3, the v m ’s in (16) can be written as vm =

N 

αi yi β m , φm (xi ) = α T (fm y).

(18)

i=1

As mentioned in Section IV-B, any kernel-based classifier trained on the feature function ψm can be used to obtain fm = [ f 1m ; . . . ; f Nm ], so it is easily adapted into either a supervised setting or semisupervised setting. The difference depends only on how to obtain fi for the i th labeled data.

to the following convex optimization problem:



 M  vm αT Qm α min − + log 1 − − 1T α log 1 − 2λ γ α∈A m=1 (16) and the corresponding discriminative function is (13) with 1

σm =

λ−

1 T 2α

Qm α

, μm =

1 γ − vm

∀m

(17)

and v m = β m , m (α). Proof: The proof is similar to that of Proposition 1. The only difference is that the prior distribution (15) is used to calculate Z (α) as 

 Z (α) =

p0 () exp

M 

 θ m , m (α) dθ

m=1

 =

p0 (σ |λ)

M 

exp

σ

m=1

 ·

p0 (μ|γ )

M 

m

2

 m (α), m (α) dσ

exp(μm β m , m (α)) dμ

m=1

=

M 

γ λ . · 1 γ − β λ −  (α), (α) m , m (α) m m 2 m=1

The second equality comes from the results of Z (α) in Proposition 2, whereas the last equality is derived from the properties by which the preceding partition function has been bounded so that (1/2) m (α)2 < λ, and β m , m (α) < γ , ∀m, together with the nonnegative intervals of σ and μ. The derivative of log Z (α) turns out to be ∂αi log Z (α) =

M 

Eθ∼ p() [θ m ], yi φm (xi )

m=1

where the expectation of θ m over p() is derived as Eθ∼ p() [θ m ] =

m (α) λ − 12  m (α), m (α)

+

βm . γ − β m , m (α)

D. Convex Optimization To solve Problem (16) with (18), sequential quadratic programming [24] is used. Let g(α) be the objective of (16). The gradient and Hessian matrix of g(α) can be derived as  M  1 um −1 Qm α + ∇g(α) = ζm ϑm m=1

∇ 2 g(α) =

M T   1 T um um 1 T T Q αα Q + Q + m m ζm2 ζm m ϑm2

m=1

where ζm = λ−(1/2)α T Q m α, ϑm = γ −v m , and u m = fm y. Given the Mercer kernels, ∇ 2 g(α) is positive semidefinite, so Problem (16) is convex. Given α (k) , the quadratic approximation of g(α) at this point is given to find the proper descent direction by solving the following optimization problem with optimal solution α (k) : 1 min g(α (k) ) + ∇g(α (k) )T  + T ∇ 2 g(α (k) ) 2 α∈A

(19)

where  = α − α (k) . This problem is a constrained quadratic programming (QP) with box constraints. Although α (k) ∈ A, it may be infeasible for (16) because implicit constraints might not be satisfied. To find the next feasible point, the backtracking line search is used to choose a proper step size s such that all the implicit constraints must be strictly satisfied and the update rule becomes α (k+1) = α (k) + s(α (k) −α (k) ). The feasibility is guaranteed by the convexity of Problem (16). We name Algorithm 1 for solving Problem (16) maximum entropy discrimination with feature concatenation (MEDFC). E. Properties In this section, we analyze the properties of our proposed framework (4) with the hierarchical prior (15). According to the property of the exponential distribution, the proposed model has the following properties. On one hand, if λ → ∞, p0 (σ |λ) is sharply peaked around σ = 0. On the other hand, p0 (σ |λ) prefers to be uniform if λ → 0. This property of exponential distribution is shown in Fig. 4(d). By increasing λ from 0 to ∞, the learned model (16) with σ varies from being dense to sparse. Similar effects can be deduced

1140

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

for hyperparameter γ and the coefficients μ of EKP from exponential distribution p0 (μ|γ ). Hence, we can vary λ and γ to obtain the different degrees of sparsity on the coefficients of the combination of kernels and combination of ensemble of view classifiers, simultaneously.1 This property is preferred because of its flexibility for arbitrarily defined base kernels and view classifiers. In the case of γ = ∞ and μ = 0, the learned weights of base kernels are σm = (λ−(1/2)α T Q m α)−1 , ∀m according to (17). The smaller the value of α T Q m α is, from (17), the more shrinkage is imposed on θm  p which is an abbreviation for Eθ∼ p() [θ m ]. By increasing λ, the shrinkage effect becomes severe. Proposition 4 connects the sparsity of the proposed model with group lasso. Proposition 4: Problem (4) with the hyperprior (15), exponential distributions p0 (σ |λ) and p0 (μ|γ ) when γ → ∞ has the following properties. 1) Relative entropy in Problem (4) is a function of ||θm  p || and λ where each component is regarded as one group. 2) When λ →∞, we can obtain the approximation of M θm  p . group lasso m=1 Proof: According to the proof of Proposition 3, we can rewrite the relative entropy term as KL( p()|| p0())  ⎡ ⎤ M T (α)  p0 () exp θ m m=1 m 1 ⎦ d = p() log ⎣ Z (α) p0 () =

M  m=1

θm Tp m (α) −

M  m=1

log

λ λ−

1 2 2 || m (α)||

.

The expectation of θm over p() is θm  p = according to the proof m (α)/λ − 1/2α T Q m α of Proposition 3, and then we reformulate it as m (α) = θm  p (λ − 1/2|| m (α)||2 ). By computing the norm on both sides, we obtain the equality || m (α)|| = ||θm  p ||(λ − 1/2|| m (α)||2 ). According to this equality, we can readily solve this problem in a closed form as || m (α)|| = −1 ± 1 + 2λ||θm  p ||2 /||θm  p ||. The nonnegative of norm enforces the unique solution || m (α)|| =  −1 + 1 + 2λ||θm  p ||2 /||θm  p ||. By substituting these equalities back to the relative entropy, we can obtain a function (||θm  p ||, λ) of the expectation ||θm  p || and parameter λ as KL( p()|| p0())  ⎤ ⎡ M  1 + 1+2λ||θm  p ||2  ⎦ ⎣ 1 + 2λ||θm  p ||2 −1−log = 2 m=1

= (||θm  p ||, λ) − M. 1 As observed in [25]–[28], sparseness of weights of base kernels is

not always beneficial in practice. λ in our proposed method has similar functionality to p in  p MKL [26], [27]. Similarly, the coefficients of the combination of view classifiers can be sparse or nonsparse, which depends on the given data. This is also controlled by the control parameter γ in our model.

When √  Mλ approaches +∞, KL( p()|| p0()) becomes 2λ m=1 ||θm  p || − M since lim (1/a) log a = 0. This a →+∞ implies that this  Mregularization term is an approximation of group lasso m=1 ||θm  p ||, which is frequently used in MKL formulations to induce sparsity of kernels. The proof is completed. As shown in Section IV-C, the proposed framework can readily incorporate decision values from any kernel-based classifier as the prior information. This property benefits the extension of the proposed framework to the semisupervised setting where the prior is obtained from both labeled and unlabeled data, as well as other settings where more information can be captured by kernel-based method. Our framework has the same complexity with prior information fm as without it, and the optimization problem remains convex.

V. E XTENSIONS OF E XISTING MKL M ETHODS TO THE G ENERALIZED L EARNING S ETTING MKL has been a popular research topic in the machine learning area [8], [14], [25], [28]–[34]. Various formulations have been explored to improve prediction accuracy as well as efficiency, which are discussed in the survey papers [29], [35] and the references therein. Instead of Bayesian inference mentioned in Section IV-C, the generalized maximum likelihood (GML) method [36] can obtain a point estimate by computing the mode of the posterior. The following proposition gives the approach for obtaining the GML estimate of hidden variables σ in the hierarchical Bayesian framework with the given prior p0 (σ |λ). Proposition 5: Given the prior p0 (σ |λ) with p0 () defined in (15), the GML estimate σ gml and the model parameter α in Proposition 1 are obtained by solving the following minimax problem: M 1 σm α T Q m α + 1T α σ ≥0 α∈A 2 m=1 (20) where σis a real vector rather than random variables and M log(γ − β m , m (α)). If log p0 (σ |λ) is conh(α) = m=1 cave with respect to σ , the alternating method can guarantee reaching the global optimum. Proof: This posterior distribution over the training data p(θ|{(x1 , y1 ), . . . , (x N , y N )}) is implicitly represented, which is learned from the training data. Similarly, p(σ ) is also a posterior distribution conditioning on the training data. According to the GML method, the GML estimate σ gml is the largest mode of p(σ ) according to the discriminative function

min max − log p0 (σ |λ) + h(α) −

  σ gml = arg max p(σ ) = arg max σ ≥0

σ ≥0

p(θ , σ , μ)dθ dμ.

If p(σ ) is sharply around the pointed estimate σ gml ,  peaked gml then p(θ) ≈ p(θ , μ, σ )dμ. By replacing p0 (σ ) with the GML point estimate σ gml , the partition function can be

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

simplified to be

C. Composite Distribution Prior 

 Z (α|σ

gml

)= =

p0 (θ |σ

gml

) exp



M 

θ m , m (α) dθ m=1  gml  M  σ m  m (α), m (α) exp p0 (σ gml |λ) 2 m=1  M 

·

p0 (μ|γ )

= p0 (σ gml |λ) ·

exp(μm β m , m (α)) dμ

m=1 M  m=1



gml

σm  m (α), m (α) exp 2



γ . · γ − β m , m (α) According to Proposition 1, we can readily obtain (20). If log p0 (σ |λ) is concave with respect to σ , the alternating method can guarantee the global solution will be reached since the minimax problem is a saddle-point problem [37]. Detailed analysis and extensions of existing MKL methods with EKP as the prior will be presented in the following. We will illustrate that our framework is more general than these existing MKL formulations and also provides these MKL formulations with a natural extension to incorporate the ensemble of view classifiers. According to Proposition 5, the connections to existing MKL methods are shown as follows. A. Exponential Distribution Prior According to Section IV-C, we obtain the optimization problem with the exponential distribution as min max λ σ ≥0 α∈A

1141

M  m=1

σm + h(α) −

M 1 σm α T Q m α + 1T α. 2 m=1

If γ = ∞ and λ is the dual variable for the constraint M m=1 σm = 1, we can recover the MKL formulation with 1 norm on the weights of kernels[8], [30], [32], [34].  MIn general, M σm = τ or m=1 σm ≤ τ we could have the constraint m=1 where for any parameter τ > 0 [38]. In this case, the number of nonzero weights is directly constrained. B. Half-Normal Distribution Prior Half-normal distribution p0 (σ |λ) is defined as the product of half-normal distributions on each element of σ , that is,  M exp −λ/2σ 2 , σ ≥ 0. By substituting this p0 (σ |λ) ∝ m=1 m prior into Problem (20), we obtain the optimization problem M λ 1 σm α T Q m α + 1T α min max ||σ ||22 + h(α) − σ ≥0 α∈A 2 2 m=1

where 2 norm is imposed on the weights of kernels. γ = ∞ has been explored in [25] and [28]. Different from exponential distribution, half-normal distribution generally produces a dense solution for σ .

The composite distribution can be defined as the product of distributions. The super-Gaussian distribution [39] is defined as the product of exponential distribution and half-normal M exp(−λ σ − λ σ 2 ). distribution, that is, p0 (σ |λ) ∝ m=1 1 m 2 m The optimization problem becomes min max λ1 1T σ + λ2 ||σ ||22 + h(α) − σ ≥0 α∈A

M 1 σm α T Q m α + 1T α. 2 m=1

If γ = ∞, this recovers thep elastic net MKL [33]. M exp(−λσ ) with p If p0 (σ |λ) ∝ m=1 ≥ 1, the m p optimization problem is minσ ≥0 maxα∈A λ||σ || p + h(α) − 1 M T T m=1 σm α Q m α + 1 α, which is  p -MKL [29], [31], [34] 2 if γ = ∞. D. Ensemble of Kernel Predictors This approach [14] consists of two stages. In the first stage, the view classifiers are chosen by cross validation on training data. This is described in Section IV-B. SVM is used for the view classifiers. After obtaining {wm , bm }, ∀m, the parameter μ can be learned by solving the following optimization problem:   N M    yi , μm (wm , ψm (xi )Hm + bm ) (21) min μ∈S

i=1

m=1

where S = {μ : μ ≥ 0, 1T μ = 1}. A separate training sample is used to learn the nonnegative coefficient μ. In [14], 2 regularization over μ can also be employed and can obtain the dense weights of view classifiers. For this case, the model with exponential distribution can be similarly replaced by half-norm distribution p0 (μ|γ ). In this paper, we consider the simplex in EKP or sparse representation by exponential distribution. If λ → ∞, σ → 0, according to (13), we obtain the decision function as M  Eθ∼ p()[ f (x; θ)] = μm β m , φm (x) =

m=1 M 

μm (wm , ψm (x) + bm )

m=1

where β m = [wm ; bm ], ∀m. Substituting it into the primal problem (4), we obtain the empirical risk which is same as the objective in (21), and p() = p0 () at the optima, so there is no regularization on μ. When the exponential distribution p0 (μm ) is considered as the prior, GML estimation of μ as in Proposition 5 can obtain similar results to (21). Hence, EKP can be considered as a special case of our proposed framework. VI. R ELATED W ORK Apart from the methods discussed in Section V, probabilistic interpretations of MKL have also been recently explored in the literature. The hierarchical prior in [18] has a similar structure, but an additional prior distribution on μ in (15) leads to the novel models (16) which combine MKL and EKP. Their formulation is nonconvex and cannot derive the block-norm formulation, but our models are convex, and the regularization formulation of BayesMKL is given in Proposition 4. Moreover,

1142

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

their empirical results show that their Bayesian MKL performs worse than 2 -MKL and others due to its sparse solution. We will show later that our Bayesian models outperform all baselines on most of data sets with dense weights and shrinkage effect. Another branch of MKL [40], [41] models the combination of kernels over the simplex (e.g., Dirichlet distribution), but we derive the weights σ from the prior distribution with the only nonnegative assumption. This allows us to make the connections with various MKL methods. Moreover, [40] and [41] involve importance sampling during variational inference which is time consuming even when the number of kernels is medium. Reference [17] is limited to the conjugated priors and the problem is nonconvex, whereas our model is convex, so global convergence is guaranteed, and more information can be easily incorporated. Our framework is built on maximum entropy discrimination (MED) [19], but it is very different from nonstationary kernel combination (NKC) [42] and maximum entropy discrimination Markov network (MaxEnDNet) [16]. The assumption of each kernel following Gaussian distribution restricts NKC and its nonconvex formulation makes it approximately optimized (e.g., omitting the entropy terms and approximating prediction function). MaxEnDNet only works in the linear case so that the parameter estimation is feasible, but our model allows each view containing more than one feature and lying in RKHS. Hence, MaxEnDNet cannot be directly used for MKL owing to its infeasible parameter estimation when the feature space is infinite. The shrinkage effect in MaxEnDNet is on the model parameter, but in our model is on the weights of kernels with group structure. The most important difference is that our framework can easily explore more information via the new prior, that is, partial correspondence and unlabeled data, whereas other methods cannot. For the semisupervised setting, few methods are especially designed for multiple kernel learning [43], [44]. For example, Tsuda et al. [43] proposed a graph-based semisupervised model in transductive setting, which could not conduct outof-sample prediction. Wang et al. [44] proposed a semisupervised MKL, which is based on the assumption of conditional expectation consensus. Our proposed framework can use any semisupervised learning method to construct view classifiers for handling unlabeled data. Hence our framework is more flexible for handling various semisupervised assumptions. We also note that methods [45], [46] integrate the localized classifier ensemble learning and multiple kernel learning in a unified framework. They use either Gaussian mixture model [45] or bagging method [46] to construct multiple MKL models from some subsets of the training data. They then use the mixture of MKL models for prediction. MKL methods [11], [34] are also applied for structured output prediction problems. They are very different from the proposed paradigm because of the view classifiers as components. Moreover, the proposed hierarchical Bayesian framework can naturally integrate MKL and view classifiers under a single convex optimization. More importantly, our framework can explore the partially correspondent instances (see Section II), but the aforementioned methods cannot.

Fig. 3. Intermediate results of MEDFC on Sonar data with γ = 102 , λ = 10, and C = 10. (a) Weights of kernels. (b) Weights of view classifiers. (c) Number of iterations on 20 replications. (d) Objective value varies in terms of iterations in the 11th replication.

Fig. 4. Parameter sensitivity of MEDFC on data set Sonar with respect to the mean of testing accuracy over 20 replications. (a)–(c) Obtained by fixing one parameter and varying others. (d) Exponential density functions with different hyperparameters used in the experiments.

VII. E XPERIMENTS We empirically investigate the proposed framework with hierarchical prior in both a supervised setting and a semisupervised setting, namely MEDFC and Semi-MEDFC, respectively. Baseline methods we consider in this paper are MKL with uniform weights (AverageMKL), ensemble of kernel predictors (EKP) [14], SimpleMKL2 [30], LevelMKL3 [32], 2 http://asi.insa-rouen.fr/enseignants/˜arakoto/code/mklindex.html 3 http://appsrv.cse.cuhk.edu.hk/˜zlxu/toolbox/level_mkl.html

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

1143

TABLE I T ESTING A CCURACY ( IN %) OF C OMPARED M ETHODS IN THE S UPERVISED S ETTING W ITH F ULL C ORRESPONDENCE W HERE N I S THE N UMBER OF D ATA P OINTS , D I S THE N UMBER OF F EATURES , M I S THE N UMBER OF K ERNELS , AND PAIRED t -T EST W ITH O NE S TAR FOR THE

S IGNIFICANT L EVEL OF 0.9 AND T WO S TARS FOR THE S IGNIFICANT L EVEL OF 0.95 BY C OMPARING MEDFC W ITH O THER M ETHODS . T HE B EST R ESULTS A RE S HOWN IN B OLD

SMOMKL( p)4 [31] with p ∈ {1.1, 2, 3}, Bayesian efficient MKL5 with sparse and nonsparse weights (BEMKL-s and BEMKL-ns, respectively) [17]. View classifiers are trained either by LIBSVM6 [21] in a supervised setting or by Primal LapSVM7 [47] in a semisupervised setting since we only consider the manifold assumption as the showcase of the proposed framework. Both AverageMKL and EKP can use Primal LapSVM as the base solver, so we named their variants in a semisupervising setting Semi-AverageMKL and SemiEKP. These are used as the baselines in the semisupervised setting. Even though other assumptions used in semisupervised learning are also feasible, we do not consider them as suitable baselines for fair comparison. We investigate our proposed method by comparing baseline methods in three different settings. First, we demonstrate the effectiveness and efficiency of our methods, parameter sensitivity analysis, and the convergence of the sequential quadratic programming in the supervised MKL setting. Then, we conduct experiments on two data-dependent priors which are constructed from either partially correspondent instances in the supervised setting or missing labels in the semisupervised setting. All the experiments are illustrated in detail in the following subsections. A. Supervised Setting With Full Correspondence For the experiments in a supervised setting with full correspondence, two different settings are conducted. First, we follow the traditional experiments on UCI data sets with widely used multiple kernel setting [17]. We then conduct experiments on a real world application such as protein subcellular localization problems [48]. 1) UCI Data Sets: Experiments are conducted on six UCI data sets. The statistical information of these data sets is shown in Table I. 70% of the data is used as the training data and the remaining 30% as the test data. The training data is normalized 4 http://research.microsoft.com/en-us/um/people/manik/code/SMOMKL/download.html 5 http://users.ics.aalto.fi/gonen/icml12.php 6 http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ 7 http://www.dii.unisi.it/˜melacci/lapsvmp/

to have zero mean and unit standard deviation, and the test data is normalized following the same mean and standard deviation from the training set. The same set of kernels is generated as in [17]: RBF kernels with ten bandwidths ({2−3 , 2−2 , . . . , 26 }) for each individual dimension as well as the full dimensions; polynomial kernels with three different degrees ({1, 2, 3}) for each individual feature and full set of features. All kernel matrices are precomputed and normalized to have a unit trace. We run the methods being compared on the same set of 20 randomly drawn replications and experimental results, and the mean and standard deviation are reported. The tradeoff parameter C in both MKL methods and SVMs is tuned in the range [10−2 , 102 ]. For EKP, view classifiers are SVMs learned from the training data with tradeoff parameter C tuned by fivefold cross validation in the first stage. In the second stage of EKP, the coefficients of combination of view classifiers are also learned from the training data since we do not have a separate validation set for fair comparison. The default setting in the publicly available code for BEMKL-s and BEMKL-ns is used. In addition to C, the additional parameter in SMOMKL for the  p regularizer on the weights of kernels is tuned in the range [10−3 , 103 ]. Both λ and γ of MEDFC are tuned in the range [0.1, 1, 5, 10, 20, 30, 50, 102, 103 ] according to the curve of exponential distribution, which uniformly covers the space shown in Fig. 4(d). The view classifiers used by EKP in the first stage are used as the prior of MEDFC. In the case of fully correspondent views, view classifiers and MKL can explore all instances in the training set. The testing accuracy of the compared methods on six UCI data sets is shown in Table I. Results show that the proposed MEDFC outperforms other methods including MKL methods and EKP. It is in line with the analysis that MEDFC seeks a better model in between MKL and EKP. We also observe that BEMKL methods perform worse than other MKL methods, especially on Liver data. This empirical discrepancy may result from its nonconvex problem on the different replications as used in [17], because solutions are affected by initialization. Fig. 3(a) and (b) shows the learned weights σ and μ with γ = 102 and λ = 10. We observe that μ are sparser than σ if compared in the same scale of y-axis. For data Sonar, this means that a small number of view classifiers is useful, but

1144

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

TABLE II CPU T IME ( IN S ECONDS ) OF C OMPARED M ETHODS IN THE S UPERVISED S ETTING W ITH F ULL C ORRESPONDENCE

it is preferred that most kernels are combined. This observation is consistent with the priors as exponential distributions p0 (σ |λ) and p0 (μ|γ ), and MKL methods with dense weights demonstrate better results than those with sparse weights, as shown by both BEMKL and SMOMKL. Table II shows the CPU time of the compared methods in the training procedure excluding the time for learning view classifiers. We observe that MEDFC is slower than other baselines on some data sets such as Pima. This may be partially because the second-order approximation may not be accurate at the first few iterations, so that more iterations are required to reach convergence. We plot the number of iterations and the convergence curve on Sonar data shown in Fig. 3(c) and (d), respectively. Parameter sensitivity of MEDFC is shown as an exemplar on data set Sonar in Fig. 4. Following the hyperparameters of exponential distribution in Fig. 4(d), testing accuracy on Sonar is varied smoothly in terms of all parameters. The best average results over 20 replications are with γ = 102 , λ = 10, and C = 10, which lie inside the parameter space. 2) Protein Subcellular Localization: Protein subcellular localization is a crucial ingredient in many important inferences about cellular processes, including the prediction of protein function and protein interactions [48], [49]. We use the protein sequence kernels which consider all motifs including motifs with gaps [48]. The data PSORTb (v.2.0) [49] with 69 kernels are used in the experiments which are available online.8 The psort+data set contains 514 instances and four classes, whereas the psort−data set contains 1444 instances and five classes. To take advantage of the binary view classifiers, we evaluate the learning performance in the one versus the rest setting. In total, there are nine binary classification problems. Experiments are conducted by following the same setting as implemented on UCI data sets. Table III shows the results on the first seven binary classification problems. We find that the results of the remaining two problems are similar to the observations reported in Table III, so we do not report them here. Similar to Table I, we can infer from the observations that the proposed MEDFC outperforms other methods, including MKL methods and EKP, which is consistent with the analysis in Section IV. B. Supervised Setting With Partial Correspondence In this section, we first conduct the experiments by constructing simulated missing views on UCI data sets to reuse 8 http://raetschlab.org//suppl/protsubloc

the ensemble of classifiers in the supervised setting with full correspondence, and investigate the changes by adding more instances with partial views. Experiments on one real data set are then simulated by randomly sampling missing views. 1) UCI Data Sets: Experiments in this section attempt to provide empirical support for our method with EKP as the proposed prior which boosts the performance of both MKL and EKP. For partial correspondence, MKL only can use the fully correspondent instances in the training set, whereas EKP can use all the training data. By shrinking the size of the fully correspondent training data for MKL, we can simulate the partial correspondence setting by changing the ratio of full correspondence instances. As mentioned in Section II, breaking the fully correspondent instance is equivalent to an instance with a missing index, so it indirectly forms M virtual instances for each view. By dropping the instances with missing views, the base kernel over each view remains exactly the same, but the number of fully correspondent instances is reduced. In contrast, the ratio of partially correspondent instances is increasing. The ratio of labeled instances for training is r . We vary r ∈ {20%, 40%, 60%, 80%, 100%}. The training data are fully correspondent if r = 100%. Since EKP can explore all the training instances, we reuse all the view classifiers chosen by EKP in the full correspondence case. Fig. 5 shows the testing accuracy of the compared methods according to the correspondence ratio on UCI data sets. We observe that the prior of EKP greatly improves MKL when the correspondence ratio is small. As the correspondence ratio gradually increases, the improvement of MKL from the prior of EKP decreases since MKL takes more and more correspondent training instances into account. However, even at r = 100%, MEDFC still outperforms other methods with a large improvement on Sonar compared with other MKL methods. This is mainly due to the different information taken by MKL and EKP. In other words, MKL considers the discrepancy of views, whereas EKP considers the decision values of view classifiers as the new generating features. 2) Pendigits Data Set: The pendigits data set9 is a penbased digit recognition with 10 classes and contains four different feature representations. The data set is split into independent training and test sets with 7494 samples for training and 3498 samples for testing. Following [50], the linear kernel for all four feature representations is used. The binary classification problem with 5 versus 9 is explored in this experiment. Unlike their experiments, we consider the case of 9 http://mkl.ucsd.edu/dataset/pendigits

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

1145

TABLE III T ESTING A CCURACY ( IN %) OF C OMPARED M ETHODS IN THE S UPERVISED S ETTING W ITH F ULL C ORRESPONDENCE . PAIRED t -T EST W ITH O NE S TAR FOR THE S IGNIFICANT L EVEL OF 0.9 AND T WO S TARS FOR THE S IGNIFICANT L EVEL OF 0.95 BY C OMPARING MEDFC W ITH O THER M ETHODS . T HE B EST R ESULTS A RE S HOWN IN B OLD

Fig. 5. Testing accuracy of compared methods on UCI data sets by varying the correspondence ratio from 20% to 100%. (a) Liver. (b) Heart. (c) Sonar. (d) Ionosphere. (e) Pima. (f) Wdbc.

partial view by keeping 50% instances as full correspondent instances while the miss views are randomly sampled in terms of the missing ratios: 50% and 80%. Moreover, we consider two additional types of baseline method: zero filling and mean-filling approaches, so we name the original baselines dropping approaches. The zero filling approaches are MKL methods which are trained on the same data but with missing views filled by zeros. Similarly, the mean-filling approaches are MKL methods which are trained on the same data but with missing views filled by the mean of the observed instances of the corresponding view. Results on the pendigits by comparing various baselines are reported in Table IV. For the dropping approaches, similar results to the UCI data sets are observed. However, the performance of some MKL methods is degraded after filling

the missing views such as AverageMKL, SimpleMKL, and LevelMKL, while others may gain a slight improvement. This implies that filling strategies may introduce noisy information and therefore it might not be good for all methods. Only the proposed method MEDFC leverages the observed information without noise. When the number of missing views increases, the performance of all methods drops owing to the decrease in useful information, but all results show that MEDFC outperforms the baseline methods. C. Semisupervised Learning With Manifold Assumption As mentioned previously, the initialization of the prior can be obtained from classifiers learned in a semisupervised setting. The only difference from the supervised setting is that

1146

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

TABLE IV A CCURACY ( IN %) OF C OMPARED M ETHODS ON P ENDIGITS D ATA S ET W ITH 5VS 9 IN THE S UPERVISED S ETTING W ITH PARTIAL C ORRESPONDENCES . T HE R ATIO OF PARTIALLY C ORRESPONDENT I NSTANCES I S 50%, W HILE THE R ATIOS OF M ISSING V IEWS A RE VARIED AS 50% AND 80%. T HE M ISSING V IEWS A RE R ANDOMLY S AMPLED F IVE T IMES . R ESULTS A RE R EPORTED W ITH M EAN AND S TD . PAIRED t -T EST W ITH O NE S TAR FOR THE S IGNIFICANT L EVEL OF 0.9 AND T WO S TARS FOR THE S IGNIFICANT L EVEL OF 0.95 BY C OMPARING MEDFC W ITH O THER M ETHODS . T HE B EST R ESULTS A RE S HOWN IN B OLD

TABLE V A CCURACY ( IN %) OF C OMPARED M ETHODS ON U NLABELED D ATA IN THE T RANSDUCTIVE S ETTING W ITH PAIRED t -T EST W ITH O NE S TAR FOR THE

S IGNIFICANT L EVEL OF 0.9 AND T WO S TARS FOR THE S IGNIFICANT L EVEL OF 0.95 BY C OMPARING S EMI -MEDFC W ITH O THER M ETHODS . T HE B EST R ESULTS A RE S HOWN IN B OLD

LIBSVM is replaced by Primal LapSVM. Three data sets are used in this experiment: Digit, Coil, and USPS.10 They fit the manifold assumption well [51]. All data sets contain 1500 instances with 10 or 100 labeled instances and the rest are unlabeled in 12 replications. For the parameter setting of Primal LapSVM, we follow [51] and [52]. The ambient and intrinsic parameters γ I and γ A are turned on the grid [10−6 , 102 ]. The normalized graph Laplacian matrix is used where the bandwidth in heat kernel is set as the mean of distances among instances and nearest neighbor is set to 5. The base kernels are constructed as follows: 1) RBF kernel with 10 different bandwidths {0.5, 1, 2, 5, 7, 10, 12, 15, 17, 20} and 2) polynomial kernels of degree {1, 2, 3}. There are 13 kernels in total. For Semi-AverageMKL, the Laplacian matrices over each view are unit trace normalized and are added with uniform weights, which is similar to the combination of kernels in MKL. The parameters for the compared methods are the same as those in the supervised setting. Table V gives the transductive accuracy on three data sets with two configurations for all the compared methods. We first observe that MEDFC outperforms MKL methods and EKP over all configurations in the supervised setting. When the semisupervised view classifiers are used to initialize the prior, 10 http://olivier.chapelle.cc/ssl-book/benchmarks.html

Semi-MEDFC outperforms supervised methods with a large improvement except on Digit-100 with marginal reduction, and also outperforms its semisupervised counterparts. These observations imply that the prior constructed from either labeled or unlabeled data can be useful for the proposed framework to improve MKL methods and EKP. VIII. C ONCLUSION In this paper, we first present a novel interpretation of MKL from the probabilistic perspective. This interpretation inspires a new framework with the proposed data-dependent prior. Hierarchical Bayesian modeling is employed to estimate this prior from data. Moreover, the proposed MKL framework can be easily adapted for either partially correspondent views or missing labels. Extensions of various existing MKL methods to this setting are also proposed. Experiments demonstrate competitive results compared with the state-of-the-art MKL methods and EKP in both supervised and semisupervised settings. We observe that view classifiers are useful for improving MKL with a small ratio of correspondence in the partial correspondence problem, and unlabeled data can be used to further improve the proposed supervised model. In future, we will study the MKL setting on test data with partially correspondent views.

MAO et al.: GENERALIZED MULTIPLE KERNEL LEARNING WITH DATA-DEPENDENT PRIORS

R EFERENCES [1] S. Rüping and T. Scheffer, “Learning with multiple views,” in Proc. Int. Conf. Mach. Learn. (ICML) Workshop Learn. Multiple Views, Aug. 2005. [2] K. Sridharan and S. M. Kakade, “An information theoretic framework for multi-view learning,” in Proc. Annu. Conf. Comput. Learn. Theory (COLT), Jul. 2008. [3] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proc. Annu. Conf. Comput. Learn. Theory (COLT), 1998. [4] S. Yu, B. Krishnapuram, R. Rosales, H. Steck, and R. Rao, “Bayesian co-training,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2007. [5] M. Amini, N. Usunier, and C. Goutte, “Learning from multiple partial observed views—An application to multilingual text categorization,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2009, pp. 28–36. [6] B. Quanz and J. Huan, “CoNet: Feature generation for multi-view semi-supervised learning with partially observed views,” in Proc. 21st ACM Int. Conf. Inform. Knowl. Manag., Oct. 2012, pp. 1273–1282. [7] F. R. Bach, G. R. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. 21st Int. Conf. Mach. Learn. (ICML), Jul. 2004, p. 6. [8] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, Dec. 2006. [9] M. Christoudias, R. Urtasun, and T. Darrell, “Bayesian localized multiple kernel learning,” Dept. Electr. Eng. Comput. Sci., Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2009-96, Jul. 2009. [10] R. Vemulapalli, J. K. Pillai, and R. Chellappa, “Kernel learning for extrinsic classification of manifold features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 1782–1789. [11] X. Xu, I. W. Tsang, and D. Xu, “Soft margin multiple kernel learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 749–761, May 2013. [12] S. Tulyakov, S. Jaeger, V. Govindaraju, and D. Doermann, “Review of classifier combination methods,” in Machine Learning in Document Analysis and Recognition. Berlin, Germany: Springer-Verlag, 2008, pp. 361–386. [13] T. Gao and D. Koller, “Active classification based on value of classifier,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2011. [14] C. Cortes, M. Mohri, and A. Rostamizadeh, “Ensembles of kernel predictors,” in Proc. Uncertainty Artif. Intell. (UAI), 2011, pp. 145–152. [15] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Dec. 2006. [16] J. Zhu and E. P. Xing, “Maximum entropy discrimination Markov networks,” J. Mach. Learn. Res., vol. 10, pp. 2531–2569, Dec. 2009. [17] M. Gönen, “Bayesian efficient multiple kernel learning,” in Proc. 29th Int. Conf. Mach. Learn. (ICML), 2012. [18] R. Tomioka and T. Suzuki, “Regularization strategies and empirical Bayeisan learning for MKL,” arXiv:1011.3090v2, 2011. [19] T. Jebara, “Discriminative, generative, and imitative learning,” Ph.D. dissertation, Dept. Archit., Massachusetts Inst. Technol., Cambridge, MA, USA, 2002. [20] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, Sep. 2001. [21] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, Apr. 2011. [22] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. 16th Int. Conf. Mach. Learn. (ICML), 1999. [23] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [24] J. Nocedal and S. Wright, Numerical Optimization. New York, NY, USA: Springer-Verlag, 1999. [25] C. Cortes, M. Mohri, and A. Rostamizadeh, “L 2 regularization for learning kernels,” in Proc. 25th Conf. Uncertainty Artif. Intell. (UAI), 2009, pp. 109–116.

1147

[26] Z. Xu, R. Jin, S. Zhu, M. R. Lyu, and I. King, “Smooth optimization for effective multiple kernel learning,” in Proc. 24th Conf. Amer. Assoc. Artif. Intell. (AAAI), Jul. 2010. [27] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien, “Efficient and accurate  p -norm multiple kernel learning,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2009. [28] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), 2009. [29] M. Kloft, U. Brefld, S. Sonnenburg, and A. Zien, “ p -norm multiple kernel learning,” J. Mach. Learn. Res., vol. 12, pp. 953–997, Mar. 2011. [30] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” J. Mach. Learn. Res., vol. 9, pp. 2491–2521, Sep. 2008. [31] S. V. N. Vishwanathan, Z. Sun, N. Ampornpunt, and M. Varma, “Multiple kernel learning and the SMO algorithm,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2010. [32] Z. Xu, R. Jin, I. King, and M. Lyu, “An extended level method for multiple kernel learning,” in Advances in Neural Information Processing Systems. East Lansing, MI, USA: Michigan State Univ., 2008. [33] H. Yang, Z. Xu, I. King, and M. R. Lyu, “Efficient sparse generalized multiple kernel learning,” IEEE Trans. Neural Netw., vol. 22, no. 3, pp. 433–446, Mar. 2011. [34] Q. Mao and I. W.-H. Tsang, “Efficient multitemplate learning for structured prediction,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 2, pp. 248–261, Feb. 2013. [35] M. Gönen and E. Alpaydin, “Multiple kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, Jul. 2011. [36] J. Berger, Statistical Decision Theory and Bayesian Analysis. New York, NY, USA: Springer-Verlag, 1985. [37] J. M. Borwein and A. S. Lewis, Convex Analysis and Nonlinear Optimization. New York, NY, USA: Springer-Verlag, 2000. [38] Z. Xu, R. Jin, J. Ye, M. R. Lyu, and I. King, “Non-monotonic feature selection,” in Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), 2009. [39] A. Hyvärinen, “Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation,” Neural Comput., vol. 11, no. 7, pp. 1739–1768, Oct. 1999. [40] T. Damoulas and M. A. Girolami, “Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection,” Bioinformatics, vol. 24, no. 10, pp. 1264–1270, Mar. 2008. [41] M. Girolami and S. Rogers, “Hierarchic Bayesian models for kernel learning,” in Proc. 22nd Int. Conf. Mach. Learn. (ICML), 2005. [42] D. P. Lewis, T. Jebara, and W. S. Noble, “Nonstationary kernel combination,” in Proc. 23rd Int. Conf. Mach. Learn. (ICML), 2006. [43] K. Tsuda, H. Shin, and B. Schölkopf, “Fast protein classification with multiple networks,” Bioinformatics, vol. 21, no. 2, pp. ii59–ii65, 2005. [44] S. Wang, S. Jiang, Q. Huang, and Q. Tian, “S3 MKL: Scalable semi-supervised multiple kernel learning for image data mining,” in Proc. Int. Conf. Multimedia (MM), 2010. [45] Y. Song et al., “Localized multiple kernel learning for realistic human action recognition in videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 9, pp. 1193–1202, Sep. 2011. [46] J. Xiao and Y. Liu, “Traffic incident detection using multiple-kernel support vector machine,” Transp. Res. Rec.: J. Transp. Res. Board, vol. 2324, pp. 45–52, 2012. [47] S. Melacci and M. Belkin, “Laplacian support vector machines trained in the primal,” J. Mach. Learn. Res., vol. 12, pp. 1149–1184, Mar. 2011. [48] C. S. Ong and A. Zien, “An automated combination of kernels for predicting protein subcellular localization,” in Proc. 8th Workshop Algorithms Bioinform. (WABI), 2008. [49] J. L. Gardy et al., “PSORTb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis,” Bioinformatics, vol. 21, no. 5, pp. 617–623, 2005. [50] M. Gönen and E. Alpaydin, “Cost-conscious multiple kernel learning,” Pattern Recognit. Lett., vol. 31, no. 9, pp. 959–965, 2010. [51] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA, USA: MIT Press, 2006. [52] Q. Mao and I. Tsang, “Parameter-free spectral kernel learning,” in Proc. Uncertainty Artif. Intell. (UAI), Jul. 2010.

1148

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 6, JUNE 2015

Qi Mao received the bachelor’s and master’s degrees in computer science from Anhui University, Hefei, China, and Nanjing University, Nanjing, China, in 2005 and 2009, respectively, and the Ph.D. degree from the School of Computer Engineering, Nanyang Technological University, Singapore. He is currently a Post-Doctoral Associate with Duke University. Durham, NC, USA

Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Hong Kong, in 2007. He was the Deputy Director of the Centre for Computational Intelligence, Nanyang Technological University, Singapore. He is an Australian Future Fellow and an Associate Professor with the Centre for Quantum Computation and Intelligent Systems, University of Technology at Sydney, Ultimo, NSW, Australia. He has authored more than 100 research papers in refereed international journals and conference proceedings, including JMLR, TPAMI, TNN/TNNLS, NIPS, ICML, UAI, AISTATS, SIGKDD, IJCAI, AAAI, ACL, ICCV, CVPR, and ICDM. Dr. Tsang was a recipient of the 2008 Natural Science Award (Class II) from the Ministry of Education, China, in 2009, which recognized his contributions to kernel methods. He was also a recipient of the prestigious Australian Research Council Future Fellowship for his research regarding Machine Learning on Big Data in 2013. In addition, he received the prestigious IEEE T RANSACTIONS ON N EURAL N ETWORKS Outstanding 2004 Paper Award in 2006, the 2014 IEEE T RANSACTIONS ON M ULTIMEDIA Prized Paper Award, and a number of Best Paper Awards and Honors from reputable international conferences, including the Best Student Paper Award at CVPR 2010, the Best Paper Award at ICTAI 2011, and the Best Poster Award Honorable Mention at ACML 2012. He was also a recipient of the Microsoft Fellowship in 2005 and the ECCV 2012 Outstanding Reviewer Award.

Shenghua Gao received the B.E. degree from the University of Science and Technology of China, Hefei, China, in 2008, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2013. He was a Post-Doctoral Fellow with the Advanced Digital Sciences Center, Singapore, from 2012 to 2014. He is an Assistant Professor with ShanghaiTech University, Shanghai, China. His current research interests include computer vision and machine learning. Dr. Gao was a recipient of the Microsoft Research Fellowship in 2010.

Li Wang received the bachelor’s degree in information and computing science from the China University of Mining and Technology, Xuzhou, China, in 2006, and the master’s degree in computational mathematics from Xi’an Jiaotong University, Xi’an, China, in 2009. She is currently pursuing the Ph.D. degree with the Department of Mathematics, University of California at San Diego, La Jolla, CA, USA. Her current research interests include large-scale polynomial optimization, semiinfinite polynomial programming, and machine learning.

Generalized multiple kernel learning with data-dependent priors.

Multiple kernel learning (MKL) and classifier ensemble are two mainstream methods for solving learning problems in which some sets of features/views a...
3MB Sizes 0 Downloads 8 Views