Learning Instance Correlation Functions for Multilabel Classification.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Learning Instance Correlation Functions for Multilabel Classification Huawen Liu, Xuelong Li, Fellow, IEEE, and Shichao Zhang, Senior Member, IEEE

Abstract—Multilabel learning has a wide range of potential applications in reality. It attracts a great deal of attention during the past years and has been extensively studied in many fields including image annotation and text categorization. Although many efforts have been made for multilabel learning, there are two challenging issues remaining, i.e., how to exploit the correlations and how to tackle the high-dimensional problems of multilabel data. In this paper, an effective algorithm is developed for multilabel classification with utilizing those data that are relevant to the targets. The key is the construction of a coefficient-based mapping between training and test instances, where the mapping relationship exploits the correlations among the instances, rather than the explicit relationship between the variables and the class labels of data. Further, a constraint, 1 -norm penalty, is performed on the mapping relationship to make the model sparse, weakening the impacts of noisy data. Our empirical study on eight public datasets shows that the proposed method is more effective in comparing with the state-of-the-art multilabel classifiers. Index Terms—1 -norm, instance-based learning, k-nearest neighbors (kNNs), multilabel classification, partial least square (PLS) regression.

I. I NTRODUCTION N the real world, objects are often associated with multiple categories simultaneously. For example, Trans. Cyber. may be associated with IEEE, cybernetics and journal; an image portraying Great Wall may be tagged with ancient architecture, cultural heritage, and China; a document reporting WorldCup 2014 may be linked with sport, soccer, Brazil, and so on. Such data are known as multilabel data. They are prevalent in many application domains, including text categorization, information retrieval, computer vision, image and video annotation, semantic Web, and bioinformatics [1], [2].

I

Manuscript received May 22, 2015; revised September 16, 2015 and November 29, 2015; accepted January 17, 2016. This work was supported in part by the China 973 Program under Grant 2013CB329404, in part by the National Science Foundation (NSF) of China under Grant 61572443, Grant 61450001, and Grant 61170131, and in part by the NSF of Zhejiang Province under Grant LY14F020012. This paper was recommended by Associate Editor R. Tagliaferri. (Corresponding author: Shichao Zhang.) H. Liu is with the Department of Computer Science, Zhejiang Normal University, Jinhua 321004, China (e-mail: [email protected]). X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, China (e-mail: [email protected]). S. Zhang is with the Department of Computer Science, Zhejiang Gongshang University, Hangzhou 310018, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2016.2519683

Multilabel learning refers to the process of constructing models from given multilabel data, allowing users to analyze patterns or behaviors behind the data and handle new scenarios [2]. Unlike traditional supervised learning, where data are tagged with class labels exclusively, the character of multilabel learning is that it concerns the data associated with more than one class label simultaneously. Since multilabel learning has a wide range of potential applications in reality, it has now attracted increasing attention from many fields. A great number of learning algorithms, such as MLStacking [3], AdaBoost.MH [4], and MLkNN [5], have been witnessed during the past years. For example, Tsoumakas et al. [6] transformed the multilabel data into single-label ones by taking k labels randomly as a new label at one time, and then constructed a model with an ensemble strategy. Liu et al. [7] exploited sparse logistic regression to explore multilabel data, while Cheng and Hullermeier [8] developed a learning method called instance-based linear regression for multilabel learning (IBLR-ML) for multilabel classification problems by using the techniques of instance-based learning and logistic regression. Two challenges, namely how to exploit the correlations effectively and how to cope with the high-dimensionality of the multilabel data efficiently, still remain for multilabel learning, albeit many efforts have been attempted. The solutions of these two challenges will advance the widespread applications of multilabel learning algorithms in reality. For the former one, several statistical learning techniques have been extensively studied to capture the correlations of variables (or features) and class labels. As a typical example, Ji et al. [9] extracted a common subspace shared among multiple labels by using ridge regression. To tackle the high-dimensional problems, dimension reduction and feature selection are two routine solutions as they were in the traditional learning methods [10]. For instance, Gu et al. [11] modeled the label correlations with a matrix-variate normal prior distribution, and then formulated the problem of feature selection as a mixed integer programming. Lee and Kim [12] performed feature selection for multilabel classification by using mutual information. However, the results of dimension reduction are difficult to be interpreted, while feature selection may lose some useful information. In this paper, we propose a simple yet effective multilabel learning algorithm without solving the aforementioned problems straightforwardly. Specifically, we construct a multilabel

c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON CYBERNETICS

classification model using a kind of mapping relationship, represented as a weighted combination form, of the instances, rather than the correlations between the variables and the class labels of data. Specifically, we adopt partial least square (PLS) regression to model the mapping relationship of the instances, and then predict the class labels of new instances with the technique of discriminant analysis. Furthermore, a sparsity-inducing factor, 1 -norm, is further performed on the relationship to make the model sparse for improving its robustness and alleviating the overfitting problem [13]. Our method is a general framework of instance-based multilabel learning methods to some extent. Although the instancebased learning methods, e.g., k-nearest neighbors (kNNs), have been widely applied in multilabel learning, a common limitation of such type of algorithms is that their performance heavily relies on k, whose optimal value varies from data to data and is difficult to be determined. Note that several statistical learning techniques, such as multivariate linear regression [14], PLS, and canonical correlation analysis (CCA), have also been used for the multilabel problems in the literature (see [10], [15], [16]). However, the difference of our method to the existing ones is that we make use of the mapping relationship of the instances, instead of the correlations between the variables and the labels of data, to construct a classification model. In a nutshell, the key contributions of this paper are highlighted as follows. 1) We propose a new learning framework for the multilabel data by utilizing the relationship of the instances, rather than the relationship of the variables and the class labels of data. 2) We exploit the technique of PLS regression to explore the mapping relationship, represented as a weighted combination form, of the instances to construct our classification model. 3) To improve the robustness of our model, a sparsityinducing term, i.e., 1 -norm, is performed on the weighted coefficients, making the model sparse. Ultimately, it can effectively handle the problems of over-fitting and noisy data. The remainder of this paper is organized as follows. Section II reviews the related work on multilabel learning. The proposed learning framework for the multilabel data is presented in Section III. The experimental results of our method with the comparing algorithms on datasets are provided in Section IV. Section V concludes this paper. II. R ELATED W ORK Due to its wide range of potential applications, a rich body of work on multilabel learning has been done during the past years. In this section, we briefly review the related multilabel learning methods. More details can be referred to good surveys (see [1], [2], [17]) and references therein. For the multilabel data, two different strategies are available. The first one, called algorithm adaption, extends the traditional learning algorithms, such as support vector machine (SVM),

AdaBoost, and kNN, with some constraints, so that they can handle the multilabel data in building models appropriately. Typical examples of this kind include backpropagation for multi-label learning (BP-MLL) [18] and AdaBoost.MH [4]. However, the performance of this kind of methods is not as good as one expected in some cases, since they deeply root in the traditional learning, considering less characters of the multilabel data. On the other side, the second strategy, named problem transformation, transforms the multilabel data into conventional ones, allowing the off-the-shelf learning algorithms work on them as usual. Afterward the individually trained classifiers are fused into an overall one by using ensemble techniques [19]. Binary relevancy (BR), Label powerset (LP), and [6] are representative examples. They, however, have not fully considered the correlations of labels and variables, which are usually dependent with each other in reality. Besides, they may suffer from the problem of unbalanced data, especially when there are a large number of labels within the data [7], [20]. To capture the correlations or dependencies, several techniques, including label ranking [21], pairwise dependency [22], probabilistic graphical [23], tree or directed acyclic graph [24], random graph [25], hypercube [26], compressive sensing [27], Bayesian network [28], and information entropy [29], have been adopted in multilabel learning. For instance, Wu et al. [30] modeled the label correlations in a form of hierarchical tree, which is constructed by using the induction of one-against-all SVM classifiers at each node. If the labels co-occur frequently at leaf nodes, they are supposed to be relevant. Note that these learning methods may improve empirical performance to some extent, they have not involved the issues of high dimensionality. For the high-dimensional problems, dimension reduction and variable (or feature) selection are two frequently used solutions. The former seeks a succinct and low-dimensional variable space to preserve intrinsic characteristics of the original data. For instance, Ji et al. [9] extracted a common subspace shared between the label and variable spaces using ridge regression. Besides, multivariate statistical methods, such as CCA and PLS, have also been used for dimension reduction [15], [16], [31]. A common limitation of dimension reduction is that the derived results are hard to be interpreted. Especially when the dimensionality of data is very high, the interpretability becomes impossible. The second solution, variable selection, aims at finding a subset of representative variables to construct models by discarding those unimportant ones [32]. As a result, it can improve the efficiency of learning and even the performance of final models [33]. Under the context of multilabel learning, conventional selection methods, such as ReliefF, F-statistic, correlation-based feature selection, χ 2 , and mutual information, have been extended to fit the multilabel data [12], [34], [35]. Note that most of selection methods work under the framework of problem transformation, leading to high computation costs [36]. Moreover, they have not considered the dependencies of the class labels [37].

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: LEARNING INSTANCE CORRELATION FUNCTIONS FOR MULTILABEL CLASSIFICATION

Instance-based learning is capable of coping with the aforementioned problems, because it constructs learning models directly from the training instances themselves [38]. In multilabel learning, the kNN method has also been extensively studied and several kNN-based learning methods, such as MLkNN [5], BRkNN, and LPkNN [6], have been developed. Besides, many multilabel classifiers take kNN as their base classifier during the classification stage [36]. The reason for the popularity of kNN is that it can be easily extended to the multilabel cases and still has relatively good performance. However, kNN is sensitive to noisy data and its performance varies from k, whose optimal value relies on data at hand and is hard to be determined [39]. Meanwhile, multiple-instance has also been considered in [40] and [41]. Here we propose a new multilabel learning algorithm using the weighted combinations of the instances, which are explored by PLS regression. To some extent, our method is one kind of the instance-based learning methods. While it does not suffer from the aforementioned problems. Moreover, it is a weighted form of kNN and free from setting the parameter k. In the literature, several learning algorithms also adopt PLS or CCA to build multilabel classification models. Sparse PLS (sPLS) [16] and regularized CCA [31] are representative examples. However, the distinctiveness of our method to the existing ones is that here PLS is used to explore the mapping relationship of the instances, instead of the correlations of the variables and the labels. III. P ROPOSED L EARNING M ETHOD A. Problem Formulation Let X = Rd be a d-dimension instance space, and L = {0, 1}q be a q-dimension space of q possible labels {l1 , l2 , . . . , lq }. An instance (or example) x ∈ X , denoted as a vector form of variable values x = (x1 , x2 , . . . , xd ), is associated with a set of labels y = {l1 , l2 , . . . , lp } ⊆ L. Formally, y can be represented as a vector of values corresponding to the labels, i.e., y = (L1 , L2 , . . . , Lq ), where Li = 1 means that the ith label li is relevant to x; otherwise li is irrelevant to x. A data collection D = {(Xi , Yi )|1 ≤ i ≤ n} consisting of n independent and identically distributed instances is called a multilabel dataset, where Xi ∈ X and Yi ∈ L. Formally, X = {Xi }ni=1 and Y = {Yi }ni=1 are denoted as n × d and n × q matrices, respectively. Given the multilabel data D, the task of multilabel learning is to generate a multilabel learning model (function or hypothesis) h : X → Y, such that some specific evaluation or loss functions are satisfied. During the stage of prediction or classification, for any unseen instance (x, ?) with unknown relevant labels, the multilabel classifier h makes a decision in predicting a proper subset of the labels for x. B. Learning Schema Generally speaking, the multilabel learning algorithms can be represented as Fig. 1, where D is divided into two parts: 1) a training dataset Dt and 2) a test dataset Ds , i.e., D = Dt ∪ Ds . Dt and Ds are usually disjointed, i.e., Dt ∩Ds = ∅. Most of the existing learning methods construct h by explicitly identifying

Fig. 1.

3

General diagram of multilabel learning.

the mapping relationship f between Xt and Yt . The underlying assumption is that Dt has the same data distribution to Ds . Thus, they only need to explore the mapping relationship f between the input space Xt and the label space Yt of Dt , when constructing the model h. Once f is available, it can be applied to Ds directly. Based on this analysis, the classification model h can be derived from f , and then used to get the final prediction results for unknown data. For example, let f be the mapping function such that Yt = f (Xt ). The prediction results ˆ t = f (Xs ). (i.e., the set of labels) of Xs is Y Usually, the mapping relationship f is represented as a real-value function f : Xt × Yt → R. From the statistical perspective, the function f (x, y) can be written as a form of conditional probability P(y|x), standing for the confidence of y being the proper labels of x. Thus the classification purpose of h can be achieved by introducing some kind of threshold θ , i.e., h(x) = {y| f (x, y) ≥ θ, y ∈ L}. Consequently, for any label y ∈ L, if f (x, y) is larger than the given threshold θ , y is considered to be relevant to x; otherwise, y is considered to be irrelevant to x. Our Goal: Rather than explicitly exploring the mapping relationship f between Xt and Yt as defined above, our goal in this paper is twofold: modeling the mapping relationship between the instances Xt and Xs (the function g in Fig. 1), and exploiting this type of relationship to construct the classification model h. The underlying assumption is that the input space X shares the same relationship with the label space Y, because they associate with the same instances. Therefore, the relationship g learned from X (i.e., Xt and Xs ) can be applied to Y (Yt and Ys ). C. Modeling the Relationship To model the mapping relationship g, the similarities or distances among the instances, which are in the heart of the instance-based or unsupervised learning methods, can be used. However, simply obtaining the most similar or nearest instances tends to be sensitive to noisy data, or raises the overfitting problem [42]. Particularly, when the dimensionality of data is high, all instances including noisy ones tend to be close to each other [39], resulting in the ineffectiveness of the instance-base learning methods. To alleviate this problem, we resort to PLS, which is a special kind of linear regression [43], to model the mapping relationship g between Xt and Xs . PLS is initially utilized to



model the correlations of two sets of variables by latent variables called components [44]. After getting the latent variables, PLS projects the response variables onto the latent space to achieve the purpose of regression. Since PLS is capable of identifying the common latent variables, it is often taken as a tool of dimension reduction. Let XT be the transpose of X. XTt and XTs are d×nt and d×ns matrices, where nt and ns are the numbers of instances within Xt and Xs , respectively. Since Xt and Xs are drawn from the same data distribution, there is a common latent instance space such that Xt and Xs share the same or similar properties. Thus, we use PLS to seek the latent space from Xt and Xs . Formally, Xt and Xs can be decomposed into the following forms: XTt = TPT + Et XTs = SQT + Es

(1)

where T ∈ Rd×m and S ∈ Rd×m are m latent vectors (called score vectors) extracted from Xt and Xs , respectively. The loading matrices P ∈ Rnt ×m and Q ∈ Rns ×m are coefficients of the latent vectors for XTt and XTs , respectively. Et and Es are the corresponding random errors. With a linear transformation, T and S can be constructed from XTt and XTs as follows: T = XTt W S = XTs C

(2)

where W ∈ Rnt ×m and C ∈ Rns ×m are called weight matrices. According to this definition, we know that the latent vectors T and S are the direction vectors of data distributions encoded within XTt and XTs . Since XTt and XTs are drawn from the same data distribution, T and S can be linearly transformed into each other. Let be a m × m diagonal matrix. S can be represented as the following form: S = T + Ed

(3)

where Ed is the residual matrix. Eventually, we have the following equation in terms of (1) and (2): XTs = XTt WQT + Es1 = XTt B + Es1

(4)

where B = WQT . This is a regression form of XTs on XTt . We take the regression model of (4) as our mapping function g, i.e., XTs = g(XTt ), in Fig. 1, because it represents the mapping relationship of Xs to Xt . Once XTs and XTt are available, the corresponding weight matrices W and C, as well as T, S, Q, and , can be obtained using PLS. Consequently, the proper set of labels Ys of Xs can be determined by the ˆ Ts = g(YTt ). It should be pointed mapping function g as Y out that the weight matrices W and C are the eigenvectors of XTt Xs XTs Xt and XTs Xt XTt Xs . They can be obtained by using singular value decomposition (SVD). However, the computing and storage costs of SVD are relatively high, when the dimensionality d is high. As mentioned above, the latent vectors T and S share some similar properties or behaviors, because they encode most of

information of Xt and Xs , respectively. Thus they are highly correlated and can be transformed linearly [see (3)] with each other. From the statistical point of view, the column vectors t and s of T and S, i.e., the corresponding directions of Xt and Xs , respectively, have maximal sample covariance. Meanwhile t and s can be derived from w and c, i.e., the column vectors of W and C, respectively. Thus, we can obtain w and c by solving the following optimization problem: arg max cov XTt w, XTs c w,c

s.t. wT w = 1, cT c = 1

(5)

where cov(·, ·) denotes the sample covariance. Further, (5) is equivalent to the following formula: 2 1 T Xt w − XTs c2 arg minw,c 2 s.t. wT w = 1, cT c = 1. (6) Equation (6) can be solved as a regression problem, or by the power iteration method [44], in an alternating manner. Indeed, let l(w, c) be an objective function of the above formula. It is bilinear with respect to w and c. If w (or c) is fixed, the function is linear in c (or w) and can be taken as a regression formula of c (or w) on w (or c). Therefore, given an initial value of w (or c), we can derive the value of c and w alternatively. After getting the first pair of singular vectors w and c, we can also extract the rest singular vectors, wi and ci (i = 2, . . . , m), by applying the same procedure in a sequential way. For example, once the first singular vectors w and c are available, their information will be removed from Xt and Xs , respectively. Thereafter, the next pair singular vectors can be obtained by using the same extraction procedure on the residual matrices. The process is repeated until there is no information encoded in the residual matrices or no enough vectors are derived. Once the weight matrices W and C are available, we can get and Q accordingly, yielding the regression model of XTs on XTt . Thus, for any instance xsi in Xs , it can be represented as a weighted combination of the instances xtl in Xt , that is T (7) xsi = xt1 xt2 . . . xtnt bi1 bi2 . . . bins . Based on (4) and (7), we can predict the proper set of labels for a new instance. The predicted result would be accurate if the training data are free of noises. However, the real-world data often contain noises, raising from different aspects. If they were still considered within the model, the predicted results would be questionable. It is worth noting that each instance xsi ∈ Xs is a linear combination of all instances within Xt , no matter what they are normal or noisy instances, with different weighted coefficients. This implies that the noisy data may also be used to construct the classification model, resulting in some adverse effects in prediction. Fortunately, the quantity of noisy data is relatively less. They would not change the main direction of covariances and have weak effects on the training process. Thus, the corresponding coefficients (e.g., bik ) of the noisy data (e.g., xtk ) within the linear combination of xsi are small.


To weaken the effects of the noisy data, a natural and effective solution is to shrink the small coefficients into zero, that is, making the weighted combination of the instances sparse. Intuitively, the larger the coefficients, the more important the corresponding instances. The instances with small coefficients have less impacts on predicting labels for new instances. As a result, shrinking them into zero would not make the prediction results varied greatly, yielding in a more robust model. We consider the 1 -norm penalty as the constraint condition to penalize the weight vectors w and c in our model, although several sparsity-inducing terms, including 0 -norm, 1,2 -norm, and smoothly clipped absolute deviation, have been introduced in [45]. As a matter of fact, the sparse constraint, i.e., the 1 -norm penalty, offers variable selection by forcing some coefficients to zero exactly [46], leading to the sparse vectors of w and c. Specifically, we impose the sparse constraint on w and c in (6) as follows: 2 1 T Xt w − XTs c2 + λw |w|1 + λc |c|1 arg minw,c,λw ,λc 2 s.t. wT w = 1, cT c = 1 (8) where λw and λc are two constants to tune the effects of constrain conditions. Let l(w, c, λw , λc ) be the objective function of (8). With the constrain conditions in l(w, c, λw , λc ), each element wi (or ci ) of w (or c) is penalized and decreases slightly in comparison with λw (or λc ). As a result, some small weight coefficients of instances become zero if λw and λc are set to be small enough. For the optimal function of l(w, c, λw , λc ), it is nonsmooth due to the nonsmoothness of the 1 -norm regularization terms. However, the minimization problem is a convex optimization one. In fact, if we fix one of weight vectors at one time, it is turned out to be a classical LASSO problem [47]. Additionally, l(w, c, λw , λc ) is still a bilinear function with regard to the weight vectors, albeit the 1 -norm penalty terms are included. Thus we solve the optimization problem by using the off-theshelf LASSO methods alternatively. For example, let us consider the situation of obtaining w when c is fixed. Differentiating (8) with respect to w and setting it to be zero, we have the following equation: (9) w = Soft Xt XTs c, λw 1 where Soft(μ, ν) = sign(μ)(|μ| − ν)+ is the so-called softthresholding function applied componentwise in the vector μ. sign(w) is the sign function of w. If wi ≥ 0, sign(wi ) equals to 1, otherwise sign(wi ) equals to −1. The function (·)+ is a positive function which returns a non-negative value or vector. Similarly, we can obtain a sparse vector of the weight vector c when w is fixed as follows: c = Soft Xs XTt w, λc 1 . (10) D. Our Learning Algorithm Based on the discussions above, we propose a novel multilabel learning method called sparse weighted instance-based multilabel (SWIM) learning. It mainly exploits weighted instances to construct the multilabel model h. Since the objective function l(w, c, λw , λc ) of (8) is a bilinear one, we adopt

5

Algorithm 1 SWIM Method for Multilabel Data Input: Multi-label data with known labels Dt = (Xt , Yt ), Multi-label data Ds = (Xs , ?), λw , and λc . Output: A label set Ys of Xs ; Initializing relative parameters, e.g., W=Q= = ∅; For i = 1, . . . , m t=1; s=1, st =0; Repeat st = st + 1; s0 = s; w= Soft(|Xt s/(sT s)|, λw 1); w=w/(wT w); t = XTt w; c = Soft(|Xs t/(tT t)|, λc 1); c=c/(cT c); s = XTs c; Until |s − s0 | ≤ or st ≥100 p= Xt t/(tT t), q= Xs s/(sT s); Q=[Q, q], W=[W, w], ii =tT s; Obtaining the residual information of Xt and Xs : XTt =XTt -tpT ; XTs =XTs -sqT ; End For Constructing the multi-label model h with W, Q and ; Return Ys =YTt WQT ;

nonlinear iterative PLSs (NIPALSs) [44] to get the sparse weighted instances for the sake of efficiency. It estimates and updates the latent vectors t and s, the coefficients w and c alternatively, rather than computing Xt XTs or Xs XTt directly. The details of our learning method is presented in Algorithm 1. It mainly consists of two nested loops, where the inner loop is an NIPALS-style procedure. It aims at obtaining the succinct structures of the weight coefficients w and c, resulting in the weights of the instances sparse. Given an initial vector of s (or t), the NIPALS-style procedure will repeat until a given convergence criterion is met. For example, the maximal number of iteration steps reaches a prespecified number, or the difference of s (or t) from its previous one is less than a given small value. Since only one latent variable is not enough to represent all information encoded within X, more latent variables are needed. The major purpose of the outer loop is to identify more latent information from the data. Specifically, after the first pair of latent variables t and s is obtained by the inner loop, its information will be removed from the original data. As a result, the second pair of latent variables having the maximal covariance with the rest data can be discovered in a similar way. Eventually, we can construct the multilabel learning model h by using W, Q, and after m iterations. The last step of SWIM is to make a prediction for the instances Xs . The label set Ys is determined according to the learned model h, i.e., (4). Since the predicted values of Ys are continuous, they should be transformed into the corresponding binary ones (i.e., {0, 1}) by comparing with a threshold θ . Usually, θ is set to zero, because Yt is assumed to be centralized before training. Each positive value of Ys is reassigned to one and the others are zero. In SWIM, two parameters need to be carefully tuned. The first is the number of latent variables (i.e., m). There is no widely accepted solution to choose the best in optimal for m. Generally, the more the latent variables, the better the performance of h and the higher the complexity. It is often provided



by users in advance or determined by some criteria like the predicted residual sum of squares or the cross validation manner. The second is the regularization constants λw and λc , which are used to control a tradeoff between the model parsimony and data fitting. A larger value will lead to better sparsity, but relatively poor performance. The cross validation, area under curve, Akaike information criterion, Bayesian information criterion, and the freedom degrees of models are popular techniques to determine their optimal values [48]. In reality, they are often set empirically for the sake of simplicity. With our experience, setting λw = λc = 0.1 and m = 25 is proper (see Section IV-B4 to get more information). Note that the inner loop of SWIM, i.e., the NIPALS-style stage, works in a power manner. Given an initial value of s, the inner loop will converge and be terminated after st iterations. Assume there are nt and ns instances within Dt and Ds , respectively, i.e., Dt ∈ Rnt ×d and Ds ∈ Rns ×d . The time complexity of computing w is O(nt ), while O(d) for t. The cases of c and s are the same. Thus, the loop inner requires O(max(nt + ns , d)st ) time toward convergence. Typically, st is a small number (st = 5 ∼ 10). If m pairs of latent variables were considered, the total time complexity of SWIM is O(nmst ), where n = max(nt + ns , d). E. Relationship to the Existing Methods In the literature, several NIPALS-like methods have been developed and applied in multilabel learning. Typical examples include sPLS [16], regularized CCA [31], and sparse CCA (sCCA) [15]. Although these multilabel learning methods are also implemented by using the NIPALS-style strategies, they construct the model h with the function f (see Fig. 1), i.e., the correlations between the instance space and the label space (Xt and Yt ), rather than the mapping relationship between the instances (Xt and Xs ) as our method does. The instance-based learning techniques, e.g., kNN, have been extensively studied in multilabel learning. For example, IBLR-ML [8] combines logistic regression and kNN to construct a multilabel classification model. MLkNN [5] exploits the label information of k nearest neighbors, as well as the principle of maximum a posteriori, to determine the label sets of unseen instances. Although SWIM also belongs to the instance-based learning methods, it is different from the conventional kNN methods at the following aspects. First, SWIM exploits PLS to explore the relationship of the instances. Second, the prediction results are determined by the similar instances in a weighted manner, instead of the labels of the k nearest neighbors. Indeed, our method is a general form of kNN. Third, our method is free from the parameter k and insensitive to noisy data, because a sparse structure has been adopted during the training stage. IV. E XPERIMENTAL S TUDY To evaluate the effectiveness and robustness of the proposed method, we conducted comparison experiments on eight multilabel datasets with 14 popular multilabel learning algorithms. This section reports our experimental settings and results.

TABLE I D ESCRIPTION OF E XPERIMENTAL DATA , W HERE I NS ., VAR ., L AB ., C ARD ., AND D ENS . R EFER TO I NSTANCES , VARIABLES (F EATURES ), L ABELS , C ARDINALITY, AND D ENSITY, R ESPECTIVELY

A. Experimental Settings 1) Datasets: We carried out the comparison experiments on eight multilabel datasets with different types and sizes. They are Arts, Education, Entertainment, Health, Recreation, Reference, Science, and Social. These datasets were frequently used to validate performance of multilabel classification models in the literature and can be downloaded from the MULAN Website.1 Table I summarizes the statistics, sizes, and label cardinality information of these datasets, where the Ins., Var., and Lab. columns denote the total numbers of instances, variables and class labels of the datasets, respectively. The Card. column refers to the average number of class labels of the instances in the corresponding dataset. The Dens. column is a fraction of the cardinality by the number of labels. From this table, one may observe that the multilabel datasets vary from the quantities of labels and differ greatly in the sizes of variables. 2) Evaluation Criteria: For the traditional learning algorithms, their effectiveness or performance is often evaluated by the criterion of precision or accuracy, which simply denotes the number of correctly predicted instances relative to the total number of instances in a test dataset. However, it is not appropriate to the case of multilabel learning, because the output of a multilabel classifier involves multiple class labels at the same time. In our experiments, we adopted four different criteria to evaluate the performance of the multilabel learning methods. They are Hamming loss (HL), ranking loss (RL), one error (OE), and average precision (AP) [1]. 1) Hamming Loss: This criterion is defined as the percentage of labels which are classified incorrectly by the multilabel classifiers. The misclassified labels include the relevant labels that have not been predicted and the irrelevant labels that have been predicted. 2) Ranking Loss: Ranking the predicted labels is important in multilabel learning, especially for the label ranking learning methods, because we always expect the relevant labels would be outputted first and their ranks should be higher then those irrelevant ones. RL is such a criterion, which refers to the mis-ordered degree that the irrelevant labels are ranked higher than the relevant ones in the predicted results. 1 http://mlkd.csd.auth.gr/multilabel.html


3) One Error: Like the criterion of accuracy in the traditional learning tasks, OE also simply summarizes the ratio of how many times the most relevant label (i.e., the top-ranked label in each predicted results) is irrelevant to the true labels of the instances. 4) Average Precision: This evaluation measurement places more emphasis on the relevant labels. It refers to the percentage of relevant labels among all labels that are ranked above. 3) Comparison Methods: To make a comparison roundly, we carried out three different groups of experiments. The first group compared SWIM to the statistical multilabel learning algorithms. The statistical learning techniques recently have been extensively studied in multilabel learning. Typical examples include PLS and CCA. Both of them are good at measuring the correlations of two sets of variables. We took PLS [44], sPLS [16], PPLS-MD [10], CCA [31], and sCCA [15] as our comparing methods. In the second group experiment, SWIM is used to compare with the instance-based learning methods. As mentioned above, kNN has been extensively studied in multilabel learning. SWIM is a general framework of the instance-based learning methods. Thus, this group aims at showing the effectiveness of SWIM in comparing with the kNN-based ones, including LPkNN [17], BRkNN [6], and MLkNN [5]. The third group made a comparison of our method to classical multilabel learning ones, such as BP-MLL [18], AdaBoost.MH [4], HOMER [49], MLStacking [3], pruned problem transformation [50], and ClassifierChain [51]. These learning algorithms stand for different learning techniques and have relatively better performance. For instance, BP-MLL [18] and AdaBoost.MH [4] extend the traditional neural network and AdaBoost learning algorithms to fit the multilabel cases, while the rest ones belong to the problem transformation kind of multilabel learning. More details of these learning methods are provided in the related work section or references therein. The proposed algorithm and the statistical ones were implemented with MATLAB. For the rest multilabel classifiers, i.e., the instance-based and classical ones, we compared them under the MULAN package [52], where the off-the-shelf learning algorithms are contained. All experiments were conducted on a Pentium IV, with a CPU clock rate of 1.7 GHz, 1 GB main memory. During the whole experiments, the tenfold crossvalidation was adopted, and the final results were the average values over the ten rounds. B. Experimental Results and Discussion 1) Comparing to the Statistical Learning Methods: Since SWIM exploits the penalized PLS to construct models, in the first group experiment we made a comparison of SWIM to the statistical learning methods, including PLS, sPLS, PPLS-MD, CCA, and sCCA. The reason of choosing CCA and its variants is that they can also be used to obtain a common and latent space between two sets of variables. The difference of CCA to PLS is that the variables in the latent space identified by CCA have maximal correlations, rather than the covariances for PLS. It should be pointed out that the statistical learning

7

methods explore the correlations between the variable space and the label space, i.e., the mapping function f in Fig. 1, while SWIM extracts the mapping function g from the instances (see Fig. 1). During the whole experimental procedures, the parameters involved within the learning algorithms were assigned to the same values for the sake of impartial comparison. For example, in the sparse variants of CCA and PLS (i.e., SWIM, sCCA, PPLS-MD, and sPLS), the regularization parameters λw and λc were equal to 0.1. In addition, all learning algorithms chose the same number of latent variables (i.e., m = 25) to build classification models. The experimental results on the evaluation criteria are presented in Fig. 2, where the notation “↓” (or “↑”) indicates that the lower (higher) of the curve, the better performance of the classifier. According to the results in Fig. 2, we know that SWIM surpassed the statistical learning algorithms, except PPLS-MD, in most cases. For example, SWIM had the lowest HL on the datasets. Besides, SWIM achieved the best performance of RL, OE, and AP on seven datasets in comparing to other statistical learning methods. For the evaluation criterion of RL (OE), SWIM was slightly worse than sPLS on Education (Arts). Even so, the differences between them were very small, and they were not significantly different to each other if a statistical t-test was considered. Comparing to PPLS-MD, SWIM achieved slightly poor performance. For example, the AP of PPLS-MD was higher than SWIM on five datasets. Similarly, PPLS-MD outperformed SWIM for the criteria of OE and RL in several cases. Even so, SWIM had lower HL than PPLS-MD on six over eight datasets. The underlying reason is that PPLS-MD exploits the sparse property of the label space to explore the mapping relationship f , while SWIM does not consider the sparse property of the label space. It just models the mapping relationship g of the instances in the variable space, and then applies g to the label space directly. Intuitively, SWIM, PPLS-MD, and sPLS have similar properties, because they are the sparse variants of PLS, notwithstanding their purposes are different. This assertion was demonstrated by the experiments. As shown in Fig. 2, SWIM and sPLS exhibited analogous behaviors. For instance, both SWIM and sPLS achieved better or worse performance on the evaluation criteria in most cases. Comparing with PLS, sPLS performed relatively worse on the Entertainment dataset. This is reasonable because sPLS may lose some information when making the model sparse. An interesting fact is that CCA and sCCA had relatively poor performance in comparing with PLS and its variants. The underlying reason is that CCA and its variants aim at identifying the latent variables such that their correlations are maximal, while PLS tries to discover the latent variables with maximal covariances. Indeed, the maximal correlations do not stand for good discriminant capabilities in classification and prediction. 2) Comparing to the kNN-Based Learning Methods: As mentioned above, SWIM takes the weights of instances into account when predicting the class labels. It is a general framework of the instance-based learning methods to some extent.



Fig. 2. Performance comparison of SWIM with CCA, sCCA, PLS, PPLS-MD, and sPLS, where the notation ↓ (or ↑) means the lower (higher) of the curve, the better performance of the classifier. (a) HL (↓). (b) RL (↓). (c) OE (↓). (d) AP (↑).

Thus, the second group experiment aims at showing the effectiveness of SWIM in comparing to the instance-based ones. We considered three popular kNN-based classifiers, i.e., BRkNN, LPkNN, and MLkNN, as our baselines. The experimental results are presented in Fig. 3, where k = 5 for each classifier. Observing Fig. 3, we know that SWIM outperformed BRkNN and LPkNN significantly. This shows a fact that the problem transformation learning algorithms, i.e., BRkNN and LPkNN, had relatively poor performance. Especially, LPkNN achieved the worst performance in the experiments. Perhaps the reason is that the densities and quantities of class labels in these datasets were very low and large, respectively, resulting in insufficient positive and negative instances to build models. Comparing to MLkNN, SWIM still showed its superiority for the criteria of HL, OE, and AP. With RL, SWIM was not worse than MLkNN in many cases, except Education and Social, where MLkNN had slightly better performance. However, when k = 3, the values of RL achieved by MLkNN on these two datasets were 0.101 and 0.073, respectively. They were higher than those of SWIM. This indicates that the performance of the instance-based learning algorithms relies on the parameter k, whose optimal value is difficult to be determined in practice. 3) Comparing to the Classical Learning Methods: Apart from the statistical and instance-based multilabel classifiers,

the classical multilabel learning methods had also been used to make a performance comparison with SWIM. These multilabel classifiers represent different learning techniques and have relatively positive performance and efficiency in reality. Table II reports the performance comparison of the multilabel classifiers with four evaluation metrics. In the table, the bold values denote that the corresponding classifiers achieved the best performance in comparing with others upon the datasets (i.e., the same rows). Additionally, a pairwise t-test between SWIM and others was carried out to elucidate wether the performance of the proposed method was significantly different to the remainders. Throughout this paper, the difference is considered to be significantly different at 95% significance level if its p-value is less than 0.05. The notation “∗” in the table indicates that the performance of the corresponding classifier was significantly worse than SWIM. According to the experimental results, it can be found that our learning algorithm had promising performance and was comparable to the classical ones in most cases. For example, for the evaluation criterion of HL, SWIM had the best performance on seven out of eight datasets. On the Education dataset, the HL of SWIM was slightly higher than MLStacking, while still lower than the others. Additionally, the difference of SWIM to the best one was very small.


9

Fig. 3. Performance comparison of SWIM with MLkNN, BRkNN, and LPkNN, where the notation ↓ (or ↑) means the lower (higher) of the curve, the better performance of the classifier. (a) HL (↓). (b) RL (↓). (c) OE (↓). (d) AP (↑).

Similar situations can also be found for the evaluation metrics of RL and OE, where SWIM still took the predominant place in comparing with the classical multilabel classifiers. In the case of AP, SWIM was significantly superior to the others, because it achieved the best performance on the datasets. It is noticeable that the algorithm adaption methods, e.g., BP-MLL and AdaBoost.MH, had relatively worse performance in comparing with the other learning algorithms. As shown in Table II, one may note that BP-MLL achieved the highest HLs, RLs, and OEs, and the lowest APs in most cases. They were significantly worse in comparing to SWIM if a pairwise t-test considered. 4) Parameter Effects: Generally speaking, the regularization parameters may bring effects on the performance of the sparse learning methods. There is no exception to SWIM, where two regularization parameters, λw and λc , are involved. To illustrate the effects of λw and λc on the proposed method, we performed SWIM on the datasets to get HL with different values of λw and λc . Fig. 4 presents the experimental results, where λw and λc (with the same values) varied from 0.05 to 1.0. The experimental results in Fig. 4 show that the value of HL of SWIM increased on the whole as the regularization parameters increasing from 0.05 to 1.0. However, the incremental degree was changed not too much. For instance, the value of HL of SWIM was 0.0226 on the Social dataset as λw (λc ) = 0.05, and increased to 0.0262 as λw (λc ) = 1.0. Apart from HL, similar situations were found for the evaluation criteria of RL, OE, and AP. This fact indicates that assigning λw (λc ) as 0.1 during all experiments is reasonable.

Fig. 4. HL of SWIM with different values of the regularization parameters λw and λc .

The second parameter of SWIM is m, i.e., the number of the latent variables. Generally speaking, the performance of the statistical learning algorithms relies on the value of m, and the more the latent variables, the better the performance of the statistical learning algorithms, while the higher the complexity. As mentioned above, how to choose an optimal value for m is still an open question. Its appropriate value was often determined empirically in the literature. In the whole comparing experiments, we chose 25 latent variables, i.e., m = 25, to construct models. To verify whether the chosen value was appropriate or not, we conducted an additional experiment for SWIM with



TABLE II P ERFORMANCE C OMPARISON OF SWIM W ITH THE C LASSICAL M ULTILABEL C LASSIFIERS , W HERE THE B OLD VALUES A RE THE B EST O NES AND ∗ D ENOTES IT I S S IGNIFICANTLY W ORSE T HAN SWIM IN A S TATISTICAL t-T EST

This is reasonable and consists with our intuitive understanding, because the more the latent variables, the more the discriminative information of SWIM. Apart from the criterion of AP, we also found similar results for the HL, RL, and OE. Due to the limitation of space, we have not provided them here one by one. V. C ONCLUSION

Fig. 5.

APs of SWIM with different numbers of the latent variables.

different quantity of the latent variables on the datasets. Fig. 5 shows the APs of SWIM, where λw (λc ) = 0.1 and m varied from 1 to 50. From the experimental results, one may observe that setting m = 25 was appropriate during the whole comparing experiments, because the APs of SWIM increased less and tended to be stable after the number of latent variables reached 25 (see the dashed vertical line in the middle of Fig. 5), although they improved greatly at the beginning.

In this paper, we proposed a novel multilabel learning method by using the mapping relationship, which is represented as a form of weighted combinations, of the instances. Specifically, the proposed method exploited the technique of PLS regression to explore the mapping relationship of the instances. As a result, each instance can be represented as a weight combination of others. To further improve its robustness, the 1 -norm penalty was imposed on the weights of the instances, making the classification model sparse. Our empirical results on eight public datasets with a broad range of the multilabel learning methods have shown that the proposed sparse learning approach was competitive and outperformed the popular multilabel learning algorithms in most cases. In the future, we will take kernel properties into account when we use PLS regression to explore the correlations of multilabel data.


ACKNOWLEDGMENT The authors would like to thank the anonymous referees and the associate editor for their valuable comments and suggestions, which have improved this paper vastly. R EFERENCES [1] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837, Aug. 2014. [2] E. Gibaja and S. Ventura, “Multi-label learning: A review of the state of the art and ongoing research,” Wiley Interdiscipl. Rev. Data Min. Knowl. Disc., vol. 6, no. 4, pp. 411–444, 2014. [3] G. Tsoumakas et al., “Correlation-based pruning of stacked binary relevance models for multi-label learning,” in Proc. MLD, Bled, Slovenia, 2009, pp. 101–116. [4] D. Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl, “MULTIBOOST: A multi-purpose boosting package,” J. Mach. Learn. Res., vol. 13, pp. 549–553, Mar. 2012. [5] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048, 2007. [6] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Random k-labelsets for multi-label classification,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 7, pp. 1079–1089, Jul. 2011. [7] H. Liu, S. Zhang, and X. Wu, “MLSLR: Multilabel learning via sparse logistic regression,” Inf. Sci., vol. 281, pp. 310–320, Oct. 2014. [8] W. Cheng and E. Hullermeier, “Combining instance-based learning and logistic regression for multilabel classification,” Mach. Learn., vol. 76, nos. 2–3, pp. 211–225, 2009. [9] S. Ji, L. Tang, S. Yu, and J. Ye, “A shared-subspace learning framework for multi-label classification,” ACM Trans. Knowl. Disc. Data, vol. 4, no. 2, 2010, Art. ID 8. [10] H. Liu, Z. Ma, S. Zhang, and X. Wu, “Penalized partial least square discriminant analysis with 1 -norm for multi-label data,” Pattern Recognit., vol. 48, no. 5, pp. 1724–1733, 2015. [11] Q. Gu, Z. Li, and J. Han, “Correlated multi-label feature selection,” in Proc. 20th ACM Int. Conf. Inf. Knowl. Manag., Glasgow, U.K., 2011, pp. 1087–1096. [12] J. Lee and D.-W. Kim, “Feature selection for multi-label classification using multivariate mutual information,” Pattern Recognit. Lett., vol. 34, no. 3, pp. 349–357, 2013. [13] L. Shao, R. Yan, X. Li, and Y. Liu, “From heuristic optimization to dictionary learning: A review and comprehensive comparison of image denoising algorithms,” IEEE Trans. Cybern., vol. 44, no. 7, pp. 1001–1013, Jul. 2014. [14] Y. Su, X. Gao, X. Li, and D. Tao, “Multivariate multilinear regression,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 6, pp. 1560–1573, Dec. 2012. [15] D. Chu, L.-Z. Liao, M. K. Ng, and X. Zhang, “Sparse canonical correlation analysis: New formulation and algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 3050–3065, Dec. 2013. [16] D. Lee, W. Lee, Y. Lee, and Y. Pawitan, “Sparse partial leastsquares regression and its applications to high-throughput data analysis,” Chemometr. Intell. Lab. Syst., vol. 109, no. 1, pp. 1–8, 2011. [17] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook, 1st ed. New York, NY, USA: Springer, 2010, pp. 667–685. [18] M.-L. Zhang and Z.-H. Zhou, “Multilabel neural networks with applications to functional genomics and text categorization,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1338–1351, Oct. 2006. [19] J. Shen, Y. Zhao, S. Yan, and X. Li, “Exposure fusion using boosting Laplacian pyramid,” IEEE Trans. Cybern., vol. 44, no. 9, pp. 1579–1590, Sep. 2014. [20] J. Zhang, X. Wu, and V. S. Sheng, “Active learning with imbalanced multiple noisy labeling,” IEEE Trans. Cybern., vol. 45, no. 5, pp. 1095–1107, May 2015. [21] J. Furnkranz, E. Hullermeier, E. L. Mencía, and K. Brinker, “Multilabel classification via calibrated label ranking,” Mach. Learn., vol. 73, no. 2, pp. 133–153, 2008. [22] E. L. Mencía, S.-H. Park, and J. Fürnkranz, “Efficient voting prediction for pairwise multilabel classification,” Neurocomputing, vol. 73, nos. 7–9, pp. 1164–1176, 2010.

11

[23] Y. Guo and W. Xue, “Probabilistic multi-label classification with sparse feature learning,” in Proc. 23rd Int. Joint Conf. Artif. Intell., Beijing, China, 2013, pp. 1373–1379. [24] W. Bi and J. T. Kwok, “Multi-label classification on tree- and DAG-structured hierarchies,” in Proc. ICML, Bellevue, WA, USA, 2011, pp. 17–24. [25] H. Su and J. Rousu, “Multilabel classification through random graph ensembles,” Mach. Learn., vol. 99, no. 2, pp. 231–256, 2015. [26] F. Tai and H.-T. Lin, “Multilabel classification with principal label space transformation,” Neural Comput., vol. 24, no. 9, pp. 2508–2542, 2012. [27] D. J. Hsu, S. M. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” in Proc. NIPS, Vancouver, BC, Canada, 2009, pp. 772–780. [28] L. E. Sucar et al., “Multi-label classification with Bayesian networkbased chain classifiers,” Pattern Recognit. Lett., vol. 41, pp. 14–22, May 2014. [29] S. Zhu, X. Ji, W. Xu, and Y. Gong, “Multi-labelled classification using maximum entropy method,” in Proc. ACM SIGIR, Salvador, Brazil, 2005, pp. 274–281. [30] Q. Wu, Y. Ye, H. Zhang, T. W. S. Chow, and S.-S. Ho, “ML-TREE: A tree-structure-based approach to multilabel learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 3, pp. 430–443, Mar. 2015. [31] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 194–200, Jan. 2011. [32] H. Liu, X. Wu, and S. Zhang, “A new supervised feature selection method for pattern classification,” Comput. Intell., vol. 30, no. 2, pp. 342–361, 2014. [33] R. Hong et al., “Image annotation by multiple-instance learning with discriminative feature mapping and selection,” IEEE Trans. Cybern., vol. 44, no. 5, pp. 669–680, May 2014. [34] H. Huang, C. Ding, D. Kong, and H. Zhao, “Multi-label ReliefF and F-statistic feature selections for image annotation,” in Proc. CVPR, Providence, RI, USA, 2012, pp. 2352–2359. [35] S. Jungjit, A. A. Freitas, M. Michaelis, and J. Cinatl, “Two extensions to multi-label correlation-based feature selection: A case study in bioinformatics,” in Proc. IEEE SMC, Manchester, U.K., 2013, pp. 1519–1524. [36] G. Madjarov, D. Kocev, D. Gjorgjevikj, and S. Džeroski, “An extensive experimental comparison of methods for multi-label learning,” Pattern Recognit., vol. 45, no. 9, pp. 3084–3104, 2012. [37] Y. Liu, F. Tang, and Z. Zeng, “Feature selection based on dependency margin,” IEEE Trans. Cybern., vol. 45, no. 6, pp. 1209–1221, Jun. 2015. [38] Y. Fu, X. Zhu, and A. K. Elmagarmid, “Active learning with optimal instance subset selection,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 464–475, Apr. 2013. [39] H. Liu and S. Zhang, “Noisy data elimination using mutual k-nearest neighbor for classification mining,” J. Syst. Softw., vol. 85, no. 5, pp. 1067–1074, 2012. [40] Q. Wang, Y. Yuan, P. Yan, and X. Li, “Saliency detection by multipleinstance learning,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 660–672, Apr. 2013. [41] Y. Xiao, B. Liu, Z. Hao, and L. Cao, “A similarity-based classification framework for multiple-instance learning,” IEEE Trans. Cybern., vol. 44, no. 4, pp. 500–515, Apr. 2014. [42] Y. Xu et al., “Data uncertainty in face recognition,” IEEE Trans. Cybern., vol. 44, no. 10, pp. 1950–1961, Oct. 2014. [43] Y. Dong, D. Tao, X. Li, J. Ma, and J. Pu, “Texture classification and retrieval using shearlets and linear regression,” IEEE Trans. Cybern, vol. 45, no. 3, pp. 358–369, Mar. 2015. [44] A.-L. Boulesteix and K. Strimmer, “Partial least squares: A versatile tool for the analysis of high-dimensional genomic data,” Briefings Bioinformat., vol. 8, no. 1, pp. 32–44, 2007. [45] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Found. Trends Mach. Learn., vol. 4, no. 1, pp. 1–106, Jan. 2012. [46] C. Hou, F. Nie, X. Li, D. Yi, and Y. Wu, “Joint embedding learning and sparse regression: A framework for unsupervised feature selection,” IEEE Trans. Cybern, vol. 44, no. 6, pp. 793–804, Jun. 2014. [47] J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., vol. 33, no. 1, pp. 1–22, 2010. [48] E. Süli and D. F. Mayers, An Introduction to Numerical Analysis. Cambridge, U.K.: Cambridge Univ., 2003.



[49] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Effective and efficient multilabel classification in domains with large number of labels,” in Proc. MMD, Antwerp, Belgium, 2008, pp. 1–15. [50] J. Read, “A pruned problem transformation method for multi-label classification,” in Proc. New Zealand Comput. Sci. Res. Student Conf., Christchurch, New Zealand, 2008, pp. 143–150. [51] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Mach. Learn., vol. 85, no. 3, pp. 335–359, 2011. [52] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, “Mulan: A Java library for multi-label learning,” J. Mach. Learn. Res., vol. 12, pp. 2411–2414, Jul. 2011.

Xuelong Li (M’02–SM’07–F’12) is a Full Professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, Shaanxi, China.

Huawen Liu received the master’s and Ph.D. degrees in computer science from Jilin University, Changchun, China, in 2007 and 2010, respectively. He is currently an Associate Professor with the Department of Computer Science, Zhejiang Normal University, Jinhua, China. His current research interests include data mining, feature selection, sparse learning, and machine learning.

Shichao Zhang (SM’05) received the Ph.D. degree from CIAE, Beijing, China. He is a China “1000-Plan” Distinguished Professor with Zhejiang Gongshang University, Hangzhou, China, and a Guangxi “BaGui Scholar” Distinguished Professor with Guangxi Normal University, Guilin, China. His current research interests include machine learning and information quality. Dr. Zhang is a a member of the ACM.

Multi-instance multilabel learning with weak-label for predicting protein function in electricigens.

Multilabel image classification via high-order label correlation driven active learning.

Augmenting multi-instance multilabel learning with sparse bayesian models for skin biopsy image analysis.

Learning sparse kernel classifiers for multi-instance classification.

Multiple instance learning for classification of dementia in brain MRI.

Query-adaptive multiple instance learning for video instance retrieval.

Coupled dimensionality reduction and classification for supervised and semi-supervised multilabel learning.

Sparse Representation Based Multi-Instance Learning for Breast Ultrasound Image Classification.

Learning Instance-Specific Predictive Models.

Multiview vector-valued manifold regularization for multilabel image classification.

Instance transfer learning with multisource dynamic TrAdaBoost.

Dealing with heterogeneous classification problem in the framework of multi-instance learning.

Patch Based Multiple Instance Learning Algorithm for Object Tracking.

Drug activity prediction using multiple-instance learning via joint instance and feature selection.

Multilabel user classification using the community structure of online networks.

On multilabel classification methods of incompletely labeled biomedical text data.

Multilabel region classification and semantic linking for colon segmentation in CT colonography.

Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning.

Classifying and segmenting microscopy images with deep multiple instance learning.

Multi-instance dictionary learning for detecting abnormal events in surveillance videos.

Multi-Instance Deep Learning: Discover Discriminative Local Anatomies for Bodypart Recognition.

Multi-Instance Metric Transfer Learning for Genome-Wide Protein Function Prediction.

Discriminative Bayesian Dictionary Learning for Classification.

A Kernel Classification Framework for Metric Learning.