1888

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Multilabel Classification Using Error-Correcting Codes of Hard or Soft Bits Chun-Sung Ferng and Hsuan-Tien Lin, Member, IEEE Abstract— We formulate a framework for applying errorcorrecting codes (ECCs) on multilabel classification problems. The framework treats some base learners as noisy channels and uses ECC to correct the prediction errors made by the learners. The framework immediately leads to a novel ECC-based explanation of the popular random k-label sets (RAKEL) algorithm using a simple repetition ECC. With the framework, we empirically compare a broad spectrum of off-the-shelf ECC designs for multilabel classification. The results not only demonstrate that RAKEL can be improved by applying some stronger ECC, but also show that the traditional binary relevance approach can be enhanced by learning more parity-checking labels. Our research on different ECCs also helps to understand the tradeoff between the strength of ECC and the hardness of the base learning tasks. Furthermore, we extend our research to ECC with either hard (binary) or soft (real-valued) bits by designing a novel decoder. We demonstrate that the decoder improves the performance of our framework. Index Terms— Error-correcting codes (ECCs), multilabel classification.

I. I NTRODUCTION

M

ULTILABEL classification is an extension of traditional multiclass classification. In particular, the latter aims at accurately associating a single label with an instance, whereas the former aims at associating a label set. Because of the increasing application needs in domains such as music categorization [1] and scene analysis [2], multilabel classification is attracting much research attention in recent years. Error-correcting code (ECC) roots from the information theoretic pursuit of communication [3]. In particular, the ECC studies how to accurately recover a desired signal block after transmitting the block’s encoding through a noisy communication channel. When the desired signal block is the single label (of some instances) and the noisy channel consists of binary classifiers, it is shown that a suitable use of the ECC improves the association (prediction) accuracy of multiclass classification [4]. Several designs, including some classic ECC [4] and adaptively constructed ECC [5], [6], reach promising empirical performance for multiclass classification. Although the benefits of the ECC are well established for multiclass classification, the corresponding use for multilabel Manuscript received August 17, 2012; revised May 9, 2013; accepted June 10, 2013. Date of publication July 3, 2013; date of current version October 15, 2013. This work was supported in part by the National Science Council of Taiwan under Grant NSC 99-2628-E-002-017 and Grant 101-2628E-002-029-MY2. The authors are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei 10617, Taiwan (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2269615

classification remains an ongoing research direction. Reference [7] took the first step in this direction by proposing a multilabel classification approach that applied a classic ECC, the Bose-Chaudhuri-Hocquenghem (BCH) code. Although this approach shows some good experimental results over existing multilabel classification approaches, a more rigorous study remains needed to understand the advantages and disadvantages of different ECC designs and will be the main focus of this paper. In this paper, we formalize the framework for applying the ECC on multilabel classification. The framework is more general than both existing ECC studies for multiclass classification [4] and for multilabel classification [7]. Then, we conduct a thorough research with a broad spectrum of classic ECC designs: repetition code, Hamming code (HAM), BCH code, and low-density parity-check code (LDPC). The four designs cover the simplest ECC idea to the state-of-the-art ECC in communication systems. Interestingly, such a framework allows us to give a novel ECC-based explanation to a popular algorithm: the random k-label sets (RAKEL) [8]. In particular, RAKEL can be viewed as a special type of repetition code coupled with a batch of internal multilabel classifiers. We empirically demonstrate that RAKEL can be improved by replacing its repetition code with the HAM, a slightly stronger ECC. Furthermore, even better performance can be achieved when replacing the repetition code with the BCH code. When compared with the traditional binary relevance (BR) approach without the ECC, multilabel classification with the ECC can perform significantly better. The empirical results justify the validity of the framework. In addition, we design a new decoder for linear ECC using multiplications to approximate EXCLUSIVE-OR (XOR) operations. This decoder is able to handle not only ordinary binary bits from the channels, called as hard inputs, but also realvalued bits, called as soft inputs. For multilabel classification using the ECC, the soft inputs can be used to represent the confidence of the internal classifiers. Our newly designed decoder allows a proper use of the detailed confidence information to produce more accurate predictions. The experimental results show that this decoder indeed improves the performance of the multilabel classification with ECC (ML-ECC) framework with soft inputs. The paper is organized as follows. Initially, we introduce the multilabel classification problem in Section I-A, and present the related works in Section I-B. Section II illustrates the framework and demonstrates its effectiveness. Section III presents a new decoder for hard or soft inputs. We empirically study the performance of the framework and the proposed

2162-237X © 2013 IEEE

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

decoder in Section IV. Finally, the conclusion is given in Section V. A short version of the paper appeared in the 2011 Asian Conference on Machine Learning [9]. The paper was then enriched by the novel decoder for dealing with soft bits, the comparison with other ECC designs, and broader experiments. The paper was also the core of the first author’s M.S. thesis [10]. A. Problem Setup Multilabel classification aims at mapping an instance x ∈ Rd to a label set Y ⊆ L = {1, 2, . . . , K }, where K is the number of classes. Following the hypercube view of [11], the label set Y can be represented as a binary vector y ∈ {0, 1} K , where y[i ] is 1 if the i th label is in Y , and 0 otherwise. N . A multilabel Consider a training data set D = {(xn , yn )}n=1 classification algorithm uses D to locate a multilabel classifier h : Rd → {0, 1} K such that h(x) predicts y well on future test examples (x, y). There are several loss functions for evaluating whether y˜ = h(x) predicts y well. Two common ones are: 1) subset 0/1 loss: 0/1 (˜y, y) = [[˜y = y]]. K  2) Hamming loss:  H L (˜y, y) = K1 [[˜y[i ] = y[i ]]]. i=1

The [[·]] is the boolean indicator function, which is 1 if the argument is true, and 0 otherwise. Reference [12] showed that the two loss functions focused on different statistics of the underlying probability distribution from a Bayesian perspective. In the coming algorithmic derivations, we discuss with 0/1 and Hamming loss functions to follow the final remark of [12] to focus on the loss functions that are related to our algorithmic goals; in Section IV, on the other hand, we will empirically evaluate the proposed framework on a wider range of other loss functions [8]. B. Related Works The hypercube view [11] unifies many existing problem transformation approaches [8], which transform multilabel classification into one or more reduced learning tasks. For instance, one simple problem transformation approach is called BR, which learns one binary classifier per individual label. Another simple problem transformation approach is called label powerset (LP), which transforms multilabel classification to one multiclass classification task with a huge number of extended labels. One popular problem transformation approach that lies between BR and LP is called RAKEL [8], which transforms multilabel classification into many multiclass classification tasks with a smaller number of extended labels. Some existing problem transformation approaches focus on compressing the label set vector y [11], [13], [14]— removing the redundancy within the binary signals (label sets) to form shorter codewords—which follows a classic task in information theory based on Shannon’s first theorem [3]. Another classic task in information theory aims at expansion— adding redundancy in the (longer) codewords to ensure robust decoding against noise contamination. The power of expansion

1889

is characterized by Shannon’s second theorem [3]. The ECC targets toward using the power of expansion systematically. In particular, the ECC works by encoding a block of signal to a longer codeword b before passing it through the noisy channel and then decoding the received codeword b˜ back to the block appropriately. Then, under some assumptions [15], the block can be perfectly recovered, resulting in zero block-decoding error. In some cases, the block can only be almost perfectly recovered, resulting in a few bit-decoding errors. If we take the block as the label set y for every example (x, y) and a batch of base learners as a channel that outputs the ˜ the block-decoding error corresponds contaminated block b, to 0/1 , whereas the bit-decoding error corresponds to a scaled version of  H L . Such a correspondence motivates us to study whether suitable ECC designs can be used to improve multilabel classification, which will be formalized in Section II. Most of the commonly used ECC in the communication systems are binary ECC. That is, the codeword b is a binary vector. We are going to apply this kind of ECC on multilabel classification, and will review some of them in Section I-C. Another kind of ECC is real-valued ECC, which uses real-valued vectors as the codewords. References [16] and [17] took this direction and design special encoding and decoding functions for multilabel classification. Reference [16] used a canonical correlation analysis to find the most linearly predictable transform of the original labels as the encoding function; [17] used metric learning to locate a good encoding function. Both researches took approximate Bayesian inference for decoding. Real-valued ECC was also used for multiclass classification, such as in [18]. C. Review of Classic ECC We review the four ECC designs that can be plugged in the framework that will be discussed in Section II. We will also take those designs in the empirical study. The four designs cover a broad spectrum of practical choices in terms of strength. Here, M is the length of codeword b and K is the length of y. 1) Repetition Code: One of the simplest ECCs is repetition (REP) code [15], for which every bit in y is repeated M/K  or M/K +1 times in b during encoding. The decoding takes a majority vote using the received copies of each bit. Because of the majority vote, repetition code corrects up to m REP = M/2K − 1/2 bit errors in b. We will discuss the connection between REP and the RAKEL algorithm in Section II-A. 2) Hamming on Repetition Code: A slightly more complicated ECC than REP is called as the Hamming code (HAM) [19], which can correct m HAM = 1 bit error in b by adding some parity-check bits (XOR operations of some bits in y). One typical choice of HAM is HAM(7, 4), which encodes any y with K = 4 to b with M = 7. Note that m HAM = 1 is worse than m REP when M is large. Thus, we consider applying HAM(7, 4) on every 4 (permuted) bits of REP [9]. The resulting code will be named as Hamming on repetition (HAMR). It is not hard to compute m HAMR by analyzing the REP and HAM parts separately. When M is a multiple of seven,

1890

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

m HAMR = 2 · 2M/7K − 1/2, which is generally better than m REP especially when M/K is large. Thus, HAMR is slightly stronger than REP for ECC purposes. We include HAMR in our paper to verify whether a simple inclusion of some parity bits for the ECC can readily improve the performance for multilabel classification. 3) BCH Code: BCH [20], [21] code can be viewed as a sophisticated extension of HAM and allows correcting multiple bit errors. BCH with length M = 2 p − 1 has (M − K ) parity bits, and it can correct m BCH = (M − K )/ p bits of error [15], which is in general stronger than REP and HAMR. The caveat is that the decoder of BCH is more complicated than the ones of REP and HAMR. We include BCH in our paper because it is one of the most popular ECCs in real-world communication systems. In addition, we compare BCH with HAMR to see if a strong ECC can do better for multilabel classification. 4) Low-Density Parity-Check Code: LDPC [15] is recently drawing much research attention in communications. LDPC consists of parity bits that are formed from small (hence the name low density) subsets of the original bits, and shares an interesting connection between ECC and Bayesian learning during decoding [15]. Although it is difficult to state the strength of LDPC in terms of a single m LDPC , LDPC is shown to approach the theoretical limit in some special channels [22], which makes it as a state-of-the-art ECC. We choose to include LDPC in our paper to see whether it is worthwhile to go beyond BCH with more sophisticated encoder/decoders. II. ML-ECC F RAMEWORK We now describe the proposed ML-ECC framework in detail. The main idea is to use an ECC encoder enc(·) : {0, 1} K → {0, 1} M to expand the original label set y ∈ {0, 1} K to a codeword b ∈ {0, 1} M that contains redundant information. Then, instead of learning a multilabel classifier ˜ h(x) between x and y, we learn a multilabel classifier h(x) between x and the corresponding b. In other words, we transform the original multilabel classification problem into another (larger) multilabel classification task. During prediction, we ˜ use h(x) = dec ◦ h(x), where dec(·) : {0, 1} M → {0, 1} K is the corresponding ECC decoder, to get a multilabel prediction y˜ ∈ {0, 1} K . The simple steps of the framework are shown as follows. 1) Parameter: an ECC with encoder enc(·) and decoder dec(·); a base multilabel learner Ab . N , return h˜ = 2) Training: Given D = {(xn , yn )}n=1   Ab xn , enc(yn ) . ˜ 3) Prediction: Given any x, return h(x) = dec ◦ h(x) by ECC decoding. This algorithm is simple and general. It can be coupled with any block-coding ECC and any base learner Ab to form a new multilabel classification algorithm. For instance, the MLBCHRF method [7] uses the BCH code (see Section I-C) as the ECC and BR on random forest as the base learner Ab . Note that [7] did not describe why ML-BCHRF may lead to improvements in multilabel classification. Next, we show

a simple theorem that connects the ML-ECC framework with 0/1 . As introduced in Section I-C, many ECCs can guarantee to correct up to m bit-flipping errors in a codeword of length M. Then, if  H L of h˜ is low, the ML-ECC framework guarantees that 0/1 of h is low. The guarantee is formalized as follows. Theorem 1: Consider an ECC that can correct up to m bit errors in a codeword of length M. Then, for any T test examples {(xt , yt )}tT=1 , let bt = enc(yt ). If ˜ =  H L (h)

T 1 ˜ t ), bt ) ≤   H L (h(x T t =1

then h = dec ◦ h˜ satisfies T 1 M . 0/1 (h) = 0/1 (h(xt ), yt ) ≤ T m +1 t =1 Proof: When the average Hamming loss of h˜ is at most , the classifier h˜ makes at most T M bits of error on all bt . As the ECC corrects up to m bits of errors in one bt , an adversarial has to make at least m + 1 bits of errors on bt to make h(xt ) different from yt . The number of such bt can be at most T M/(m + 1). Thus, 0/1 (h) is at most T M/T (m + 1). From Theorem 1, it appears that we should simply use some stronger ECC, for which m is larger. Nevertheless, we are applying the ECC in a learning scenario. Thus,  is not a ˜ fixed value, but depends on whether Ab can learn well from D. Stronger ECC usually contains redundant bits that come from complicated compositions of the original bits in y, and the compositions may not be easy to learn. The tradeoff is revealed when applying the ECC to multiclass classification [6]. We will study the ECC with different strengths and empirically verify the tradeoff in Section IV.

A. ECC View of RAKEL An immediate use of the ML-ECC framework is to explain the RAKEL algorithm proposed by [8]. Define a k-label set as a size-k subset of L. Each iteration of RAKEL randomly selects a (different) k-label set and builds a multilabel classifier on the k labels with an LP. After running for R iterations, RAKEL obtains a size-R ensemble of LP classifiers. The prediction on each label is done by a majority of votes from classifiers associated with the label. Equivalently, we can draw (with replacement) M = Rk labels first before building the LP classifiers. Then, selecting k-label sets is equivalent to encoding y by a variant of REP, which will be called as RAKEL repetition (RREP) code. Similar to REP, each bit y[i ] is repeated several times in b as label i is involved in several k-label sets. After encoding y to b, each LP classifier, called as k-powerset, which acts as a subchannel that transmits a size-k sub-block of the codeword b. The prediction procedure follows the decoder of the usual REP. The ECC view above decomposes the original RAKEL into two parts: the ECC and the base learner Ab . We will empirically study how the two parts affect the performance of multilabel classification in Section IV.

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

1891

TABLE I

B. Comparison with ECC with Real-Valued Codewords As mentioned in Section I-B, [16] and [17] proposed using a real-valued ECC for multilabel classification. There are two major differences between the real-valued ECC and the binary ML-ECC framework. In terms of codewords, the real-valued codewords of [16] and [17] were linear combinations of labels. In contrast, the binary codewords introduced in Section I-C are parity checks ( XOR) of labels, which is a linear combination of labels under Galois field G F2 . In terms of base learners, the transformed learning tasks of real-valued codewords are regression instead of classification. In [16] and [17], the regression task of each codeword bit was learned separately, which is like applying the BR base learner. In contrast, the transformed learning task of binary codewords can be learned with other base learners such as k-powerset. III. G EOMETRIC D ECODER FOR L INEAR ECC The real-valued codewords motivate us to design a new decoder for binary ECC in the ML-ECC framework. The goal of the new decoder is to use the channel measurement information—the real-valued confidence for the bit to be 1. Such information is available to the decoder when using the BR base learner (channel) with probabilistic outputs. The realvalued confidence may help to improve the performance of the framework by focusing more on the highly confident bits. The off-the-shelf ECC decoders, especially syndrome decoding algorithms, usually do not exploit the channel measurement information, but take advantages of the algebraic structure of the ECC to locate possible bit errors. From the hypercube view, they decode a vertex of {0, 1} M (binary prediction on codewords) to a vertex of {0, 1} K (binary prediction on labels). The proposed decoder, on the contrary, uses the information and considers the geometry of the hypercube to perform interior-to-interior decoding from [0, 1] M to [0, 1] K . That is, the proposed decoder is a soft-input soft-output decoder. The soft input bits contain the channel measurement information, and the value of each bit represents the confidence in the bit being 1. The soft prediction of labels is the confidence in whether the label presents. For evaluation, the soft predictions are then rounded to {0, 1} K as in [11]. We call the proposed decoder as geometric decoder, and call the off-the-shelf decoders as algebraic decoders. Here, we focus on linear codes, whose encoding function can be written as a matrix–vector multiplication under Galois field G F2 . The repetition code, HAMR code, BCH code, and LDPC code are all linear codes. Let G be the generating matrix of a linear code, gi j ∈ {0, 1}. The encoding is done by b = enc(y) = G · y (mod 2), or equivalently we may write the  terms of XOR operations as follows:  formula in . That is, the codeword bit bi is the y bi = j j :gi j =1 result of XOR of some label bits y j . The XOR operations are equivalent to multiplications if we map 1 → −1 and 0 → 1. Through defining bˆi = 1 −2bi and yˆj = 1 −2y j , the encoding can also be written as follows: bˆi = j :gi j =1 yˆ j . We denote this form as multiplication encoding. It is difficult to generalize the XOR operation from binary to real values, but multiplication by itself can be defined

D ATA S ET C HARACTERISTICS

on real values. We take this advantage and use it to form our geometric decoder. Our geometric decoder would find the y˜ that minimizes the L 2 distance between b˜ and the multiplication encoding result of the y˜ as follows: ⎞2 ⎛ M  ˜ = arg min ⎝(1 − 2b˜i ) − (1 − 2 y˜ j )⎠ . decgeo (b) y˜ ∈[0,1] K i=1

j :gi j =1

Here, the squared L 2 distance between codewords is an approximation of the Hamming distance in the binary space {0, 1} M . For repetition code, as only one y j is considered for each bi , the optimal solution of the problem would be the same as averaging over the predictions on the same label for each label. For general linear codes, it is, however, difficult to find the global optimum as the optimization problem may not be convex. Instead, we may apply a variant of the coordinate descent optimization algorithm to find a local minimum. That is, in each step, we optimize only one y˜ j while fixing other y˜ j . To optimize one y˜ j , we only have to solve a second-order single-variable optimization problem, which enjoys an efficient analytic solution. The benefit of using soft-output geometric decoder is that the multiplication-approximated XOR preserves some geometric information. That is, close points in [0, 1] K would also be close after multiplication encoding. In addition, approximating XOR by multiplication allows us to consider soft input, i.e., confidence information, during decoding. IV. E XPERIMENTS We examine the performance of the ML-ECC framework on seven real-world data sets [23]. The names and statistics of these data sets are shown in Table I. Each data set originally comes with its own training and test sets [23]. To estimate the stability of each approach while maintaining the characteristics of the training/test sets, we randomly resplit the data set for 30 different runs while keeping the training/test ratio unchanged, except for the largest tmc2007 data set. For tmc2007, we randomly subsample 5% for training and another 5% for testing in each run. All these results are then reported with the mean and standard error on the test sets over the 30 runs. We consider four ECCs from the simplest to very sophisticated ones: RREP (see Section II-A), HAMR, BCH, and LDPC (see Section I-C). For RREP, k is set to 3. When coupling with 3-powerset learner, ML-ECC with the RREP code is then equivalent to the RAKEL algorithm. We take the BCH algorithm in [24], which uses Berlekamp–Massey

1892

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

algorithm for decoding; we take the LDPC algorithm in [25] and set the decoding method to probability propagation. Then, for each ECC, we first consider a 3-powerset with either random forest, nonlinear support vector machine (SVM), or logistic regression as the multiclass classifier inside the 3-powerset. We randomly permute the bits of b and apply an inverse permutation on b˜ for those ECCs other than RREP to ensure that each 3-powerset work on diverse sub-blocks. The results are shown in Section IV-A to IV-C. In addition to the 3-powerset base learners, we also consider BR base learners in Section IV-D. We take the default random forest from Weka [26] with 60 trees. For the nonlinear SVM, we use LIBSVM [27] with the Gaussian kernel and choose (C, γ ) by cross validation on training data from {2−5 , 2−3 , . . . , 27 } × {2−7, 2−5 , . . . , 23 }. In addition, we use LIBLINEAR [28] for the logistic regression and choose the parameter C by cross validation from {2−5 , 2−3 , . . . , 27 }. Furthermore, we compare the performance between the proposed geometric decoder and the off-the-shelf decoders in Section IV-E, and compare the soft-decoded MLECC framework with the real-valued ECC approaches in Section IV-F. In addition to the 0/1 loss and Hamming loss mentioned in Section I-A, we also list the minimum of precision and recall (MPR) loss (1 minus MPR), the microaveraged F1 score, and the macroaveraged F1 score for reference. The MPR loss is defined as MPR (˜y, y) = 1 − y · y˜ / max( y 1 , ˜y 1 ). The macro-F1 score takes the average of the F1 score for each label. The micro-F1 score calculates the F1 score on the predictions of labels as a whole. The experiments taken in this paper are generally broader than existing works that are related to multilabel classification with the ECC in terms of the data sets, the codes, the channels, and the base learners, as shown in Table II. The goal of the experiments is not only to justify that the framework is promising, but also to rigorously identify the best codes, channels, and base learners for solving general multilabel classification tasks via the ECC. A. Validity of ML-ECC Framework Intially, we demonstrate the validity of the ML-ECC framework by comparing it with the simple BR approach and the RAKEL algorithm. We fix the codeword length M to about 20 times larger than the number of labels K . The numbers are in the form 2 p − 1 for integer p because the BCH code only works on such lengths. More experiments on different codeword lengths are presented in Section IV-B. Following the description in Section II-A, we implement the RAKEL algorithm using the ML-ECC framework with RREP and k-powerset learners, and fix k = 3. To be fair in comparison, we also use 3-powerset learners for ML-ECC methods using either HAMR, BCH, or LDPC code. The simple BR approach is considered as the baseline. For clarity of the figures, instead of showing absolute values of

evaluation measures, we show the amount of improvement made by each method over the BR approach (the higher the better). Firstly, we look at the results using random forests as the multiclass learner in k-powerset. As shown in Fig. 1(a), all of RAKEL and ML-ECC methods have positive improvement on 0/1 over BR, which means that they perform better than BR in terms of 0/1 . HAMR improves more on 0/1 than RAKEL (RREP) on five out of the seven data sets (scene, emotions, yeast, tmc2007, and medical) and achieves similar 0/1 with it on the other two. This verifies that using some parity bits instead of repetition improves the strength of ECC, which in turn improves the 0/1 loss. Along the same direction, BCH performs even better than both HAMR and RAKEL, especially on medical data set. The superior performance of BCH justifies that the ECC is useful for multilabel classification. On the other hand, another sophisticated code, LDPC, performs worse (improves less) than BCH on every data set, and is even worse than RAKEL (RREP) on the emotions and yeast data sets, which suggest that LDPC may not be a good choice for the ML-ECC framework. Next, we look at  H L shown in Fig. 1(b). The Hamming loss of HAMR is comparable with that of RAKEL (RREP), where each wins on two data sets, and both of them are better than the BR baseline. BCH beats both HAMR and RAKEL on the tmc2007, genbase, and medical data sets but loses on the other four data sets. It performs even worse than BR on two data sets. LDPC performs the worst in terms of Hamming loss among the codes on all data sets, even worse than the BR baseline. Thus, simpler codes such as RREP and HAMR perform better in terms of  H L . A stronger code such as BCH may guard 0/1 better, but it can pay more in terms of  H L . We also report the MPR loss MPR in Fig. 1(c). The result is similar to that on 0/1 . BCH code is the best, followed by HAMR. RAKEL (RREP) is better than the BR baseline on six data sets and worse than both HAMR and BCH on five data sets. Similar results show up when using the Gaussian SVM or logistic regression as the base learner, as shown in Table III, which lists the average ranking of each code given the base learner and the measure. The average ranking is calculated by ranking each code on each data set and averaging the ranks. The boldface entries are the best (highest rank) one for the given base learner. The results validate that the performance of multilabel classification can be improved by applying the ECC. More specifically, we may improve the RAKEL algorithm by learning some parity bits instead of repetitions. With this experiment, we suggest using HAMR when hoping to improve 0/1 , MPR , and F1 scores while maintaining promising  H L (to RREP). Even better 0/1 and macro-F1 can be achieved using the BCH code with some tradeoffs to  H L and micro-F1 . B. Comparison of Codeword Length Now, we compare the length of codewords M. With larger M, the codes can correct more errors but the base learners have to take longer time to train. Through experimenting different

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

1893

TABLE II F OCUS OF E XISTING W ORKS T HAT A RE R ELATED TO THE ML-ECC F RAMEWORK

Fig. 1.

Performance of ML-ECC using the 3-powerset with random forests. (a) 0/1 loss. (b) Hamming loss. (c) MPR loss.

TABLE III AVERAGE R ANKINGS OF BR, RAKEL (k = 3), AND ML-ECC U SING 3-P OWERSET BASE L EARNERS

M, we may find a better tradeoff between performance and efficiency. The performance of the ML-ECC framework with different codeword lengths on the scene data set is shown in Fig. 2. Here, the base learner is again the 3-powerset with random forests. The codeword length M varies from 31 to 127, which is about 5–20 times of the number of labels K . We do not include shorter codewords because their performance is not stable. BCH only allows M = 2 p − 1 and thus we conduct experiments of BCH on those codeword lengths. We first look at the 0/1 loss in Fig. 2(a). The horizontal axis is the codeword length M and the vertical axis is the 0/1 loss on the test set. We see that 0/1 of RREP stays around 0.335 no matter how long the codewords are. This implies

that the power of repetition bits reaches its limit very soon. For example, when all the 3-powerset combinations of labels are learned, additional repetitions give very limited improvements. Therefore, methods using repetition bits only, such as RAKEL, cannot take advantage from the extra bits in the codewords. The 0/1 of HAMR and BCH are slightly decreasing with M, but the differences between M = 63 and M = 127 are generally small (smaller than the differences between M = 31 and M = 63, in particular). This shows that learning some parity bits provides additional information for prediction, which cannot be learned easily from repetition bits, and such information remains beneficial for longer codewords, comparing with repetition bits. One reason is that the number of 3-powerset combinations of parity bits is exponentially more than that of combinations of labels. The performance of LDPC is not as stable as the other codes, possibly because of its sophisticated decoding step. Somehow, we still see that its 0/1 decreases slightly with M. HL versus M for each ECC is shown in Fig. 2(b). The  H L of RREP is the lowest among the codes when M is small, but it remains almost constant when M ≥ 63, while HL of HAMR and BCH are still decreasing. This matches our finding that extra repetition bits give limited information. When M = 127, BCH is comparable with RREP in terms of  H L . HAMR is even better than RREP at that codeword length, and becomes the best code regarding  H L . Thus, although a stronger code such as BCH may guard 0/1 better, it can pay more in terms of  H L . As stated in Section I-A and II, the base learners serve as the channels in the ML-ECC framework and the performance of base learners may be affected by the codes. Therefore, using a strong ECC does not always improve multilabel classification

1894

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

0.16 RREP (RAKEL) HAMR BCH LDPC

0.36

0.086

RREP (RAKEL) HAMR BCH LDPC

0.084

0.15 0.14 0.13

0.32

0.3

0.082

Bit Error Rate

Hamming Loss

0/1 Loss

0.34

0.08

0.078

0.12 0.11 0.1 0.09 0.08

0.28

RREP (RAKEL) HAMR BCH LDPC

0.076

0.07 0.26 30

40

50

60

70 80 90 codeword length

100

110

120

130

0.074 30

40

50

60

70 80 90 codeword length

(a) Fig. 2. rate.

(b)

100

110

120

130

0.06 30

40

50

60

70 80 90 codeword length

100

110

120

130

(c)

Performance on scene versus codeword length (ML-ECC using the 3-powerset with random forests). (a) 0/1 loss. (b) Hamming loss. (c) Bit error

performance. Next, we verify the tradeoff by measuring the ˜ which is defined as the Hamming bit error rate BER of h, ˜ loss between the predicted codeword h(x) and the actual codeword b. Higher bit error rate implies that the transformed task is harder. The BER versus M for each ECC is shown in Fig. 2(c). RREP has almost constant bit error rate. HAMR also has nearly constant bit error rate but at a higher value. The bit error rate of BCH is similar to that of HAMR when the codeword is short, but the bit error rate increases with M. One explanation is that some of the parity bits are harder to learn than repetition bits. The ratio between repetition bits and parity bits of both RREP and HAMR codes is a constant of M (RREP has no parity bits and HAMR has three parity bits for every four repetition bits), whereas BCH has more parity bits with larger M. The different bit error rates justify the tradeoff between the strength of the ECC and the hardness of the base learning tasks. With more parity bits, we can correct more bit errors, but may have harder tasks to learn; when using fewer parity bits or even no parity bits, we cannot correct many errors, but will enjoy simpler learning tasks. Similar results show up in other data sets with all three base learners. The performance on the yeast data set with the 3-powerset and random forests is shown in Fig. 3. Because the number of labels in the yeast data set is about twice of that in the scene data set, the codeword length here ranges from 63 to 255, which is also about twice longer than that in the experiments on the scene data set. Again, we see that the benefits of parity bits remain valid for longer codewords than repetition bits and that more parity bits cause the transformed task harder to learn. This result points out the tradeoff between the strength of the ECC and the hardness of the base learning tasks. C. Bit Error Analysis To further analyze the difference between different ECC designs, we zoom in to M = 127 of Fig. 2. The instances are divided into groups according to the number of bit errors at that instance. The relative frequency of each group, i.e., the ratio of the group size to the total number of instances, is plotted in Fig. 4(a). The average 0/1 and  H L of each group are also plotted in Fig. 4(b) and (c). The curve of each

ECC forms two peak regions in Fig. 4(a). In addition to the peak at 0, which means no bit error happens on the instances, the other peak varies from one code to another. The positions of the peaks suggest the hardness of the transformed learning task, similar to our findings in Fig. 2(c). We can clearly see the difference on the strength of different ECCs from Fig. 4(b). BCH can tolerate up to 31 bit errors, but its 0/1 sharply increases over 0.8 for 32 bit errors. HAMR can correct 13 bit errors perfectly, and its 0/1 increases slowly when more errors occur. Both RREP and LDPC can perfectly correct only 9 bit errors, but LDPC is able to sustain a low 0/1 even when there are 32 bit errors. It would be interesting to study the reason behind this long tail from a Bayesian network perspective. We can also look at the relation between the number of bit errors and  H L , as shown in Fig. 4(c). The BCH curve grows sharply when the number of bit errors is larger than 31, which links to the inferior performance of BCH over RREP in terms of  H L . The LDPC curve grows much slower, but its rightsided peak in Fig. 4(a) still leads to higher overall  H L . On the other hand, RREP and HAMR enjoy a better balance between the peak position in Fig. 4(a) and the growth in Fig. 4(c) and thus lower overall  H L . The transformed learning task of more sophisticated ECC is harder, as shown in Fig. 4(a). The reason is that sophisticated ECC contains many parity bits, which are the XOR of labels, and the parity bits are harder to learn by the base learners. We demonstrate this in Fig. 5 using the scene data set while fixing M = 127. The codeword bits are divided into groups according to the number of labels XOR’ed to form the bit. The relative frequency of each group is plotted in Fig. 5(a). We can see that all codeword bits of RREP are formed by one label, and the bits of HAMR are formed by one or three labels. For BCH and LDPC, the number of labels XOR’ed in the bits may be none (0) to all (six) labels, whereas most of the bits are the XOR of half of the labels (three labels). Next, we show how well the base learners learned on each group in Fig. 5(b). Here, the base learner is 3-powerset with random forests. This figure suggests that the parity bits (XOR’ing two or more labels) result in harder learning tasks and higher bit error rates than original labels (XOR’ing one label). One exception is the bits XOR’ed from all (six) labels,

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

RREP (RAKEL) HAMR BCH LDPC

0.85 0.84

1895

0.21

RREP (RAKEL) HAMR BCH LDPC

0.35

0.205

0.82 0.81 0.8

Bit Error Rate

0.3 Hamming Loss

0/1 Loss

0.83

0.2

0.25

0.195

0.2 0.79 RREP (RAKEL) HAMR BCH LDPC

0.19

0.78

0.15

0.77 60

80

100

120

140 160 180 codeword length

200

220

240

260

60

80

100

120

140 160 180 codeword length

(a)

240

60

260

0.045 0.04 0.035

120

140 160 180 codeword length

200

220

240

260

(c)

1

0.5

0.9

0.45

0.8

0.4

0.7

0.35

0.6 0/1 Loss

0.03 0.025

0.5

0.02

0.4

0.015

0.3

0.01

0.2

0.005

0.1 10

20

30

40 50 60 number of bit errors

70

80

90

0 0

0.3 0.25 0.2 0.15 0.1

RREP (RAKEL) HAMR BCH LDPC 10

20

30

40 50 60 number of bit errors

(a) Fig. 4.

100

Performance on yeast versus codeword length (ML-ECC using the 3-powerset with random forests). (a) 0/1 loss. (b) Hamming loss. (c) Bit error

RREP (RAKEL) HAMR BCH LDPC

0 0

80

(b)

0.05

Relative Frequency

220

Hamming Loss

Fig. 3. rate.

200

70

80

RREP (RAKEL) HAMR BCH LDPC

0.05 0 0

90

10

20

(b)

30

40 50 60 number of bit errors

70

80

90

(c)

Bit errors and losses on the scene data set with M = 127. (a) Relative frequency. (b) 0/1 loss. (c) Hamming loss versus number of bit errors. 1

0.18

RREP (RAKEL) HAMR BCH LDPC

0.9 0.8

0.16 0.14 0.12

0.6

Bit Error Rate

Relative Frequency

0.7

0.5 0.4

0.08 0.06

0.3 0.2

0.04

0.1

0.02

0 0

1

2 3 4 number of labels XOR’ed

5

6

(a) Fig. 5.

0.1

0 0

RREP (RAKEL) HAMR BCH LDPC 1

2 3 4 number of labels XOR’ed

5

6

(b)

Parity bits on the scene data set, six labels, 127-bit codeword. (a) Relative frequency. (b) Bit error rate versus number of labels XOR’ed.

which is easier to learn than original labels. The reason is that the bit XOR’ed from all labels is equivalent to the indicator of odd number of labels, and a constant predictor works well for this because in the scene data set about 92% of all instances are of one or three labels. As BCH and LDPC have many bits XOR’ed from two to four labels, their bit error rates are higher than RREP and HAMR, as shown in Fig. 2(c). These findings also appear on other data sets and other base learners, such as medical data set (45 labels, M = 1023) shown in Fig. 6. BCH and LDPC have many bits XOR’ed from about half of the labels, and the transformed learning

tasks of such bits are harder to learn than that of original labels. D. Comparison Using BR Learners In addition to the 3-powerset base learners, we also consider BR base learners, which simply build a classifier for each bit in the codeword. If we couple the ML-ECC framework with RREP and BR, the resulting algorithm is almost the same as the original BR. For example, using RREP and BR with SVM is equivalent to using BR with bootstrap aggregated SVM. In addition to the ECC used in the ML-ECC framework, we also include the calibrated label ranking (CLR) algorithm [30]

1896

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

1

0.25

RREP (RAKEL) HAMR BCH LDPC

0.9 0.8

0.2

0.6

Bit Error Rate

Relative Frequency

0.7

0.5 0.4

0.15

0.1

0.3 0.2

0.05

RREP (RAKEL) HAMR BCH LDPC

0.1 0 0

5

10

15 20 25 30 number of labels XOR’ed

35

40

45

(a) Fig. 6.

0 0

5

10

15 20 25 number of labels XOR’ed

30

35

(b)

Parity bits on the medical data set, 45 labels, 1023-bit codeword. (a) Relative frequency. (b) Bit error rate versus number of labels XOR’ed.

in our comparison. CLR augments the BR approach by pairwise preference learning on the labels. It encodes the K -bit label vector to K (K + 1)/2 bits, but the encoding focus on label preferences instead of error correction. We first compare the performance using the BR base learner with random forests. The ML-BCHRF method proposed in [7] is the BCH entry in the figures. Again, we take the simple BR approach as the baseline and show the improvement over the baseline in the figures. The result on 0/1 loss is shown in Fig. 7(a). From the figure, we can see that RREP has a little improvement over BR baseline, which is the benefit of aggregation. Both HAMR and BCH achieve great improvement and beat RREP and BR baseline, with BCH being a better choice. This shows that the improvement on the strength of ECC does improve the 0/1 loss. LDPC also performs better than BR baseline on four data sets, but not as good as HAMR and BCH overall. Thus, over-sophisticated ECC such as LDPC may not be necessary for multilabel classification. CLR only improves little over BR baseline, which means that the strength of error correction is the key that helps the 0/1 loss. In Fig. 7(b), we present the results on  H L . In contrast to the case when using the 3-powerset base learner, here both HAMR and BCH can achieve better improvement on  H L than RREP in most of the data sets. Both HAMR and BCH are better than BR baseline. Overall, HAMR wins on three data sets, whereas BCH wins on four. Thus, coupling stronger ECC with the BR base learner can improve both 0/1 and  H L . LDPC, however, performs worse than BR baseline in terms of  H L , which again shows that LDPC may not be suitable for multilabel classification. The performance of CLR has no significant difference from BR baseline, which again shows that the strength of error correction is important to the performance. The result on MPR loss is shown in Fig. 7(c), which is very similar to that on 0/1 . BCH is the best on six data sets and HAMR is the best on two. LDPC also performs better than BR baseline on six data sets, whereas RREP is better on four only. CLR is as good as BCH and HAMR on the enron data set, but is similar to the BR baseline on other four data sets.

TABLE IV AVERAGE R ANKING OF BR, CLR, AND ML-ECC U SING BR BASE L EARNERS

Experiments with other base learners also support similar findings, as shown in Table IV. HAMR performs much better than BCH on  H L , and both are better than BR and RREP (aggregated BR) on most of the measures. Thus, extending BR by learning some more parity bits and decoding them suitably by the ECC is a superior algorithm over the original BR. E. Validity of the Geometric Decoder The previous experiment shows that using either HAMR or BCH code improves the performance of BR learners. Now, we are going to examine whether the proposed geometric decoder can make some further improvements. We empirically compare the off-the-shelf algebraic decoder with hard inputs, the proposed geometric decoder with hard inputs and the proposed geometric decoder with soft inputs. Note that hard inputs are the direct binary bits in the codeword and soft inputs contain the confidence that the bits are one. We include the hardinput geometric decoder into comparison to see whether the geometric decoder is competitive to the algebraic ones when the base learner does not provide soft inputs. The algebraic

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

Fig. 7.

1897

Performance of ML-ECC using BR with random forests. (a) 0/1 loss. (b) Hamming loss. (c) MPR loss.

decoders are discussed in Section I-C for RREP and HAMR, taken from [24] for BCH and [25] for LDPC. All the BR base learners that we consider, random forests from WEKA, Gaussian SVM from LIBSVM, and logistic regression from LIBLINEAR, all support calculating the confidence for binary classification. The confidence is naturally considered as the soft input. The experiment setting of these learners is the same as in previous experiments. In addition to the discriminate models, we also consider Gaussian mixture model (GMM) from MATLAB, which is a generative model that produces probabilistic predictions that can be considered as the soft input. The number of components in GMM is chosen from {1, 2, 4, 8, 16} according to Bayes information criterion; the covariance matrix of each component is restricted to be diagonal; a small 10−7 is added to the diagonal of covariance matrices to make them positive definite. In the following figures, we take the algebraic decoder as the baseline and report the amount of improvement over the baseline. The results on the BCH code using the Gaussian SVM base learner are shown in Fig. 8. We can see from Fig. 8(a) that in terms of 0/1 , both geometric decoders are significantly better than the algebraic one, especially on emotions and yeast data sets, and the two geometric decoders perform similar except that the soft-input one is significantly better on medical data set. This demonstrates the usefulness of the geometric decoders. Things are quite different on  H L . From Fig. 8(b), we can see that the soft-input geometric decoder is much better than the hard-input one in terms of  H L , but the algebraic decoder is usually the best here. The reason may be that the geometric decoder minimizes the distance between approximated enc(˜y) and b˜ in the codeword space. The BCH code, however, does not preserve the Hamming distance during encoding and decoding between {0, 1} K and {0, 1} M , hence the geometric decoder, which minimizes the distance in [0, 1] M (and approximately in {0, 1} M ), may not be suitable to the Hamming loss (Hamming distance in {0, 1} K ). However, when taking confidence information as soft input, the geometric decoder can perform better on Hamming loss, and become comparable with the algebraic decoder sometimes. Regarding the MPR loss shown in Fig. 8(c), the soft-input geometric decoder is the best on six data sets, whereas the hard-input geometric decoder is within the error bar of the best

on four. This again shows that the geometric decoder is able to improve the result comparing with the algebraic decoder. The result shows that the proposed geometric decoder on BCH code is better than the ordinary algebraic decoder for 0/1 and MPR , but not for  H L . In addition, the soft input is useful for the geometric decoder in terms of both  H L and MPR . Next, we look at the HAMR code. On the 0/1 loss shown in Fig. 9(a), the soft-input geometric decoder is better than the hard-input one and also the algebraic one on three data sets, but worse than both of them on two data sets. The hardinput geometric decoder is better than the algebraic one on three data sets, and performs similar on the other four. Among the hard-input and soft-input geometric decoders, there is no clear winner. In terms of Hamming loss in Fig. 9(b), the softinput geometric decoder performs the best. Its Hamming loss is significantly lower than that of the hard-input geometric decoder and the algebraic decoder on emotions, scene, and tmc2007 data sets, and similar to them on other data sets. Comparing the hard-input geometric and algebraic decoders, there is no significant difference except that the hard-input geometric decoder is significantly worse on the emotions data set. The result on MPR loss, as shown in Fig. 9(c), is like that on 0/1 very much. The soft-input geometric decoder wins on three data sets, and the hard-input one wins on two. The result again shows that the proposed geometric decoder on HAMR code is better than the ordinary algebraic decoder for 0/1 , but not for  H L . However, with soft input, the geometric decoder can give lower  H L . Similar results show up when using other base learners, as shown in Table V. From this experiment, we see that using the hard-input geometric decoder instead of the off-the-shelf algebraic one leads to improvement on 0/1 and MPR , but pays for  H L . If using the soft-input geometric decoder, the harm of  H L is mitigated and the improvement on other measures is strengthened. Therefore, the soft-input geometric decoder is a better choice for the ML-ECC framework. F. Comparison with Real-Valued ECC Our ML-ECC framework only considers binary ECC. Here, we compare our ML-ECC framework with real-valued ECC methods: coding with canonical correlation analysis

1898

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Fig. 8. Performance of ML-ECC with hard-/soft-input geometric decoders and algebraic decoder on the BCH code using BR and SVM. (a) 0/1 loss. (b) Hamming loss. (c) MPR loss.

Fig. 9. Performance of ML-ECC with hard-/soft-input geometric decoders and algebraic decoder on the HAMR code using BR and SVM. (a) 0/1 loss. (b) Hamming loss. (c) MPR loss.

(CCA-OC) [16] and max-margin output coding (MaxMargin) [17], as discussed in Section II-B. The experiment setting is basically the same, but because of page limits, we only highlight the results on the scene and emotions data sets, with random forests or logistic regression base learners. Both real-valued ECC methods limit their codeword length to at most twice of the number of labels, and the codeword contains K binary bits for original labels and at most K real-valued bits. In the following experiment, we take all 2K binary and real-valued bits for the real-valued ECC methods. There is a parameter λ in decoding for balancing the two parts and we set it to 1 (equally weighted). For our MLECC framework, we consider HAMR and BCH codes with the proposed soft-input geometric decoder and use 127-bit binary codewords. In addition to the 0/1 loss and the Hamming loss, we also report micro- and macro-F1 score following [16] and [17]. The results on random forests learners are shown in Table VI, and the results on logistic regression learners are shown in Table VII. With a stronger base learner such as random forests, the HAMR and BCH codes are better than both real-valued ECC methods on the two data sets and on most of the measures. With the logistic regression learner, while BCH code performs the best on scene data set, it only wins on 0/1 loss on emotions data set. The real-valued ECC methods give higher micro- and macro-F1 score than HAMR and BCH on the emotions data set. The reason may be that

TABLE V AVERAGE R ANKING OF G EOMETRIC AND A LGEBRAIC D ECODERS U SING ML-ECC F RAMEWORK W ITH BR L EARNERS

the power of logistic regression base learner is limited, and the parity bits of HAMR and BCH are too difficult for the base

FERNG AND LIN: MULTILABEL CLASSIFICATION USING ECCs OF HARD OR SOFT BITS

TABLE VI C OMPARISON B ETWEEN ML-ECC AND R EAL -VALUED ECC M ETHODS U SING R ANDOM F ORESTS BASE L EARNERS

TABLE VII C OMPARISON B ETWEEN ML-ECC AND R EAL -VALUED ECC M ETHODS U SING L OGISTIC R EGRESSION BASE L EARNERS

learner. On the other hand, the codes generated by CCA-OC and MaxMargin methods are easier to learn for such a linear model. The results demonstrate that the proposed binary-ECCbased framework coupled with a sufficiently sophisticated base learner is a better choice for ML-ECC. V. C ONCLUSION We presented a framework for applying the ECCs on multilabel classification. We then studied the use of four classic ECC designs, namely the RREP, HAMR, BCH, and LDPC. We showed that RREP can be used to give a new perspective of the RAKEL algorithm as a special instance of the framework with the k-powerset as the base learner. We conducted the experiments with the four ECC designs on various real-world data sets. The experiments further clarified the tradeoff between the strength of the ECC and the hardness of the base learning tasks. Experimental results demonstrated that several ECC designs can lead to a better use of the tradeoff. For instance, HAMR was superior over RREP for the k-powerset base learners and led to a better algorithm than the original RAKEL in terms of 0/1 loss while maintaining a comparable level of Hamming loss. BCH was an another superior design, which could significantly improve RAKEL in terms of 0/1 loss. When compared with the traditional BR algorithm and the CLR algorithm, we showed that using a stronger ECC such as HAMR or BCH can lead to better performance in terms of both 0/1 and Hamming loss. In addition to the framework, we also presented a novel geometric decoder for general linear code based on approximating the XOR operation by multiplication. This decoder was capable of not only considering hard input as algebraic

1899

decoders, but also considering soft input from the channel. The soft input may be gathered from the base learner channels as their confidence of an instance to be in one class. The experimental result on this new decoder demonstrated that this decoder can outshine the ordinary decoder in terms of 0/1 loss and reach comparable Hamming loss when exploiting the soft input from the BR learner. It remained an interesting research problem on appropriately gathering soft input from the k-powerset learner. Overall, the promising experimental results suggested that ML-ECC with the HAMR and BCH codes to be useful firsthand choices for multilabel classification, whereas the RREP and LDPC codes were not worth considering. The HAMR code can reach better 0/1 and MPR losses while keeping the Hamming loss promising, whereas the BCH code led to the best 0/1 and MPR loss. When using a sufficiently sophisticated BR learner that supported confidence predictions, the geometric decoder was recommended; otherwise an algebraic decoder can readily reach promising performance. The results justified the validity and usefulness of the framework when coupled with some classic ECC such as HAMR and BCH. Through learning some parity bits, the performance on many measures including 0/1 loss and Hamming loss can be improved. A future direction is to consider adaptive ECC such as the ones studied for multiclass classification [5], [6]. ACKNOWLEDGMENT The authors would like to thank T. Dietterich, S.-D. Lin, Y.-J. Lee, and the anonymous reviewers for their valuable suggestions. R EFERENCES [1] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. Vlahavas, “Multilabel classification of music into emotions,” in Proc. 9th Int. Conf. Music Inf. Retr., 2008. [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, Sep. 2004. [3] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 623–656, Oct. 1948. [4] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,” J. Artif. Intell. Res., vol. 2, pp. 263–286, Jan. 1995. [5] R. E. Schapire, “Using output codes to boost multiclass learning problems,” in Proc. 14th Int. Conf. Mach. Learn., 1997. [6] L. Li, “Multiclass boosting with repartitioning,” in Proc. 23rd Int. Conf. Mach. Learn., 2006. [7] A. Z. Kouzani and G. Nasireding, “Multilabel classification by BCH code and random forests,” Int. J. Recent Trends Eng., vol. 2, no. 1, pp. 113–116, 2009. [8] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for multilabel classification,” in Proc. 18th Eur. Conf. Mach. Learn., 2007. [9] C.-S. Ferng and H.-T. Lin, “Multi-label classification with errorcorrecting codes,” in Proc. 3rd Asian Conf. Mach. Learn., 2011. [10] C.-S. Ferng, “Multi-label classification with hard-/soft-decoded errorcorrecting codes,” M.S. thesis, Dept. Comput. Sci. Inf. Eng., Nat. Taiwan Univ., Taipei, Taiwan, 2012. [11] F. Tai and H.-T. Lin, “Multi-label classification with principal label space transformation,” Neural Comput., vol. 24, no. 9, pp. 2508–2542, Sep. 2012. [12] K. Dembczy´nski, W. Waegeman, W. Cheng, and E. Hüllermeier, “On label dependence in multi-label classification,” in Proc. 2nd Int. Workshop Learn. Multi-Label Data, 2010.

1900

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

[13] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” in Proc. 24th Neural Inform. Process. Syst. Conf., 2009. [14] Y.-N. Chen and H.-T. Lin, “Feature-aware label space dimension reduction for multi-label classification,” in Proc. 27th Neural Inform. Process. Syst. Conf., 2012. [15] D. J. C. Mackay, Information Theory, Inference and Learning Algorithms, 1st ed. Cambridge, U.K.: Cambridge Univ. Press, 2003. [16] Y. Zhang and J. Schneider, “Multi-label output codes using canonical correlation analysis,” in Proc. 14th Int. Conf. Artif. Intell. Stat., 2011. [17] Y. Zhang and J. Schneider, “Maximum margin output coding,” in Proc. 29th Int. Conf. Mach. Learn., 2012. [18] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discriminative least squares regression for multiclass classification and feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 11, pp. 1738–1754, Nov. 2012. [19] R. W. Hamming, “Error detecting and error correcting codes,” Bell Syst. Tech. J., vol. 26, no. 2, pp. 147–160, 1950. [20] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correcting binary group codes,” Inf. Control, vol. 3, no. 1, pp. 68–79, Mar. 1960. [21] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres, vol. 2, no. 2, pp. 147–156, 1959. [22] R. G. Gallager, Low Density Parity Check Codes, Monograph. Cambridge, MA, USA: MIT Press, 1963. [23] G. Tsoumakas, J. Vilcek, and E. S. Xioufis, “MULAN: A Java library for multi-label learning,” J. Mach. Learn. Res., vol. 12, pp. 2411–2414, Jul. 2010. [24] R. H. Morelos-Zaragoza, The Art of Error Correcting Coding, 2nd ed. New York, NY, USA: Wiley, 2006. [25] R. M. Neal. (2006). Software for Low Density Parity Check Codes. [Online]. Available: http://www.cs.toronto.edu/radford/ftp/LDPC-200602-08/ [26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update,” SIGKDD Explorations, vol. 11, no. 1, pp. 10–18, Jun. 2009. [27] C.-C. Chang and C.-J. Lin. (2001). LIBSVM: A Library for Support Vector Machines [Online]. Available: http://www.csie.ntu.edu.tw/ ∼cjlin/libsvm [28] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008. [29] A. Z. Kouzani, “Multilabel classification using error correction codes,” in Proc. 5th Int. Conf. Adv. Comput. Intell., 2010. [30] J. Fürnkranz, E. Hüllermeier, E. L. Mencia, and K. Briaker, “Multilabel classification via calibrated label ranking,” Mach. Learn., vol. 73, no. 2, pp. 133–153, 2008.

Chun-Sung Ferng received the B.S. and M.S. degrees in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 2010 and 2012, respectively. His past research includes works on ranking, multilabel classification, and large-scale data mining. Mr. Ferng is a member of the NTU team that won the championship of both tracks in KDDCup in 2011, and a member of another NTU team that won the championship of track 2 in KDDCup in 2012. He is a member of the NTU team that won a gold medal in the 2010 World Finals of the ACM International Collegiate Programming Contest. He received two silver medals in the International Olympiad in Informatics in 2005 and 2006.

Hsuan-Tien Lin (M’11) received the B.S. degree in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 2001, the M.S. and Ph.D. degrees in computer science from the California Institute of Technology, Pasadena, CA, USA, in 2005 and 2008, respectively. He was an Assistant Professor with the Department of Computer Science and Information Engineering, National Taiwan University, in 2008. He has been an Associate Professor since August 2012. He has co-authored the introductory machine learning textbook Learning from Data. His current research interests include theoretical foundations of machine learning, studies on new learning problems, and improvements on learning algorithms. Prof. Lin received the Outstanding Teaching Award from the university in 2011 and the 2012 K.-T. Li Young Researcher Award by ACM Taipei Chapter. He has co-led the teams that won the third place of KDDCup in 2009 slow track, the champion of KDDCup in 2010, the double-champion of the two tracks in KDDCup in 2011, and the champion of track 2 in KDDCup in 2012. He is currently serving as the Secretary General of Taiwanese Association for Artificial Intelligence.

Multilabel classification using error-correcting codes of hard or soft bits.

We formulate a framework for applying error-correcting codes (ECCs) on multilabel classification problems. The framework treats some base learners as ...
2MB Sizes 0 Downloads 0 Views