Transfer ordinal label learning.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

1863

Transfer Ordinal Label Learning Chun-Wei Seah, Ivor W. Tsang, and Yew-Soon Ong

Abstract— Designing a classifier in the absence of labeled data is becoming a common encounter as the acquisition of informative labels is often difficult or expensive, particularly on new uncharted target domains. The feasibility of attaining a reliable classifier for the task of interest is embarked by some in transfer learning, where label information from relevant source domains is considered for complimenting the design process. The core challenge arising from such endeavors, however, is the induction of source sample selection bias, such that the trained classifier has the tendency of steering toward the distribution of the source domain. In addition, this bias is deemed to become more severe on data involving multiple classes. Considering this cue, our interest in this paper is to address such a challenge in the target domain, where ordinal labeled data are unavailable. In contrast to the previous works, we propose a transfer ordinal label learning paradigm to predict the ordinal labels of target unlabeled data by spanning the feasible solution space with ensemble of ordinal classifiers from the multiple relevant source domains. Specifically, the maximum margin criterion is considered here for the construction of the target classifier from an ensemble of source ordinal classifiers. Theoretical analysis and extensive empirical studies on real-world data sets are presented to study the benefits of the proposed method. Index Terms— Classifier selection, domain adaptation, ordinal regression, sentiment analysis, source sample selection bias, transfer learning.

I. I NTRODUCTION

T

O DATE, many practical realizations of machine intelligence are making their way as important tools that assist humans in their decision-making process. A motivating example is sentiment rating prediction on user reviews as a tool for crafting novel marketing strategies on newly launched products (referred as the target domain). Each user review can then be categorized into different star ratings (often represented as ordinal labels in machine classification), where a higher star rating shows a better feedback on the product. In practice, most newly launched products have many user comments posted on the Internet. It is usually the case that few of such comments are, however, readily tagged with sentiment star-rating labels. To address the absence of such label information, the field of domain adaptation (DA) learning embarks the feasibility study

Manuscript received December 3, 2012; revised March 21, 2013 and June 6, 2013; accepted June 10, 2013. Date of publication June 28, 2013; date of current version October 15, 2013. This work was supported in part by the Multiplatform Game Innovation Centre in Nanyang Technological University, the Interactive Digital Media Programme Office hosted by the Media Development Authority of Singapore, and the Singapore NTU A*SERC under Grant 112 172 0013. The authors are with the School of Computer Engineering, Nanyang Technological University, 639798 Singapore (e-mail [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2268541

on classifier for new target domains using the available label information of other related source domains. The initial work of DA, as proposed in [1], presented a study involving the use of a single-related source domain that shared common joint distribution with the target domain of interest. Subsequent works moved on to relax the strict common joint distribution assumption. In particular, dissimilarity in the marginal distributions among domains was established as a covariate shift [2]–[5]. To date, a common remedy to address the covariate shift issue is by means of instance reweighting [6]–[10], where the weight of each source data sample vector is defined according to the density ratio of the target P t (x) to source P s (x) marginal distributions, i.e., P t (x)/P s (x). In this manner, the dissimilarities between the source and target domains are modeled with different marginal distributions P(x), whereas the similarity of predictive distributions P(y|x) across different domains are preserved. The kernel mean matching (KMM) method [7], for instance, considers first an estimate on the weight of each source sample vector by minimizing the MMD [11] between the source labeled samples and the target unlabeled samples. The reweighted source samples are subsequently used for training the target classifier. More recently, a unified framework of the density-ratio estimation based on the Bregman divergence has been proposed [12], and included KMM as one of its variants. Another popular scheme in the DA field is to seek for an appropriate feature representation of the source domain that corresponds well to the feature space of the target domain [13]–[19]. For instance, by minimizing the MMD between the source and target samples, transfer component analysis (TCA) [18] identifies a suitable latent space spanned by some basis vectors, referred as the transfer components. In recent years, many advancing DA methods broadened their scopes to consider leveraging the label information from the multiple relevant source domains. For instance, by treating every source domain equally, multiple convex combinations (MCC) formulates the target classifier as a fusion of multiple support vector machine (SVM) classifiers that are learned from the individual relevant source domains [20], [21]. It is, however, worth noting that a simple and direct compilation of all data in the source domains to complement the target learning task can lead to adverse outcomes [22], especially when the classifier learned from the source data fails to serve as a discriminative on the target data. As such, an extension of the MCC, labeled as the DA machine in [23], was subsequently proposed where prior knowledge on the source and target domains was incorporated to define the importance of each source classifier. More importantly, it is worth highlighting that in general, all DA methods train the target classifier by minimizing the

2162-237X © 2013 IEEE

1864


empirical risk defined based only on the source data or its weighted samples. With such a design process, the classifier is likely to exhibit properties that are steered toward the distribution of the source domain and this inevitably induces biases in the resultant prediction, thus potentially leading to poor accuracy on the prediction of the target unseen data. Particularly, we consider the study of transfer ordinal label learning (TOLL) as the bias is expected to be more severe when multiple classes are involved [24], [25]. We refer the phenomenon here as the source sample selection bias.1 To alleviate the source sample selection bias, it is generally advisable to directly minimize the expected risk functional defined only on the target data, for example, by means of leveraging from any prior knowledge that may be available on the output label structure of the target domain. Nonetheless, an intuitive solution is to group the target unlabeled samples via an unsupervised learning paradigm, subjected to some criteria imposed, such as maximum margin clustering (MMC) and maximum margin context described in [29] and [30], respectively. Particularly, MMC maximizes the margin between opposite clusters by considering all possible combinations of labels on the target unlabeled samples. To be specific, MMC optimizes the labels of u unlabeled samples from cu unique label combinations for a c class problem. MMC, however, has its limits. In not considering the class structure, such as the abundance of label information that are readily available in the related source domains (for instance, ordinal class labels in the context of ordinal regression), the tendencies of underperforming DA methods, in general, are high. Besides, the approach may sometimes lead to trivial solutions, such as the case where all samples are grouped with the same class label and hence deemed as futile [29], [31]. In this paper, our interest lies in addressing the challenges pertaining to source sample selection bias in the absence of target labeled data. In contrast to existing DA works and MCC, we propose a novel TOLL approach, which imposes the maximum margin criterion on the target unlabeled data in the process of constructing the target classifier from an ensemble of source ordinal classifiers. Here, this paper assumes the source and target domains to share the same tasks.2 In the absence of target labeled data, it is reasonable to assume that the feasible solution space of the target ordinal labels can be spanned by a series of source ordinal classifiers. The core contributions of this paper are summarized as follows. 1) Existing DA methods that seek for instance reweighting or appropriate feature representation have to date only taken the marginal distribution differences between source and target domains into considerations. Furthermore, it is established that the effects of source sample selection bias become more severe and challenging in the context of ordinal problems. Despite the advancements on DA approaches, to date none considers making use of the ordinal information in their framework as 1 Sample selection bias is well known in econometrics [26], [27] and in data set shift [28], and covariate shift is considered as one of its variant [9]. 2 In the event where the source and target domains originate from different tasks, the reader is then referred to [32]–[34].

means to improve ordinal predictions, mainly because the transfer of output structures from source to target domains is a nontrivial task. To the best of our knowledge, this paper thus presents the first DA work that embarks an investigation on the issues pertaining to source sample selection bias under the challenging context of ordinal regression. Particularly, TOLL learns the ordinal labels of the target unlabeled data from a convex hull of the ordinal outputs that are predicted by multiple source classifiers, namely the label vectors. 2) We present the generalization absolute error bound for ordinal regression in the target domain. Our analysis shows that, when the target unlabeled data follow the cluster assumption [35], [36] well, the classifier with a large target margin can reduce this error bound. In the experimental study of the sentiment classification application, the results manifest that the ensemble source ordinal classifiers with a larger target margin are associated with a smaller testing absolute error in the target domain. This verifies the appropriateness and effectiveness in choosing discriminative source classifiers for ordinal regression in the DA setting. 3) Furthermore, our extensive experimental studies highlight that the TOLL emerged as superior to several stateof-the-art DA methods in most of the tasks considered, and is robust to various settings of differing class distribution ratios between the source and target domains. The rest of this paper is organized as follows. Section II gives the preliminaries and a brief review on ordinal regression. Section III introduces the formulation of TOLL and implementation details. Extensive experiments on sentiment, newsgroup, and email data sets are then carried out in Section IV. The experimental results are then analyzed and discussed in Section V. Finally, the conclusive remarks of this paper are drawn in Section VI. A preliminary work of TOLL can be found in [37] and this paper serves as a significant add on, which includes, but is not limited to, the extension to ordinal regression, derivation of generalization absolute error bound, and experimental study on ordinal regression problems. II. P RELIMINARIES AND R EVIEW OF O RDINAL R EGRESSION In this section, the notation symbols used in this paper and a brief review of the extended binary classification model for ordinal regression are presented. A. Notations

Throughout the rest of this paper, a superscript denotes the transpose of a vector or a matrix, is the elementwise product operator, I [·] is an indicator function that returns a 1 if the predicate holds, otherwise a zero is returned, and sign(·) is a function that returns −1 if the input is negative; otherwise, +1 is returned. In addition, 1 is a vector with all ones. Given m source domains and one target domain Xu , which contains u unlabeled (testing) samples, x j ∈ p , the task in DA is to leverage from the available labeled data in relevant source domains, to predict the class label

SEAH et al.: TRANSFER ORDINAL LABEL LEARNING

Source Domain 1 Precomputed Classifiers

Source Domain m

Unlabeled data in Target Domain

Unlabeled data

Step 1: Generate target label vectors (see Algo. 1)

Generate Target Label Space

Fig. 1.

Target label space

1865

Unlabeled data

Step 2: Transfer Ordinal Label Learning (see Algo. 2)

Labels of target unlabeled data

Learning the Ordinal Labels of Target Unlabeled Data

TOLL Framework.

yˆ j ∈ {1, 2, ..., K } of each unlabeled sample in the target domain involving a K ordinal class problem. In addition, a K ordinal class problem is represented by K − 1 ordered thresholds: θ1 ≤ θ2 ≤ · · · ≤ θ K −1 , where θ0 = −∞ and θ K = ∞. A predictive output h(x) of sample x that falls between θk−1 ≤ h(x) ≤ θk is thus classified as class k. B. Extended Binary Classification Model for Ordinal Regression In this section, we briefly outline an extended binary classification model that showcases the state-of-the-art performances for ordinal regression [38], [39].3 An ordinal labeled samples (x, y) can be extended to K − 1 binary samples in SVM algorithm via the following transformation: xk = (x, ek ) ∈ p+K −1 y k = 1 − 2I [y ≤ k]

(1)

for k = 1, 2, ..., K − 1, where ek ∈ K −1 is a vector with the kth element being one, whereas the rest of the elements are zero. As an extended binary sample has a dimension of ( p + K −1), the weighted vector w of SVM is also augmented to become (w, −θ ), which is used to give the binary predictive value of xk as follows:

f (xk ) = sign((w, −θ ) xk ) = sign(h(x) − θk )

(2)

where h(x) = w x. Using (2), the predictive class label of sample x is then given as follows: K −1

I [ f (xk ) = 1] + 1.

(3)

k=1

III. P ROPOSED TOLL The learning process of the proposed TOLL framework is shown in Fig. 1. Without loss of generality, source classifiers are first trained for each unique combination of source domains. The source classifiers can be trained using any DA method that is readily available. The source classifier can even be precomputed so as to preserve the interests of a company, such as the privacy and security of customer data. In TOLL, the relevancy and specificity of each source classifier are then learned with respect to the target domain. 3 A very similar idea was previously presented in [40].

In particular, TOLL alleviates the presence of any unwanted sample selection bias that may exist by learning the biases of each source classifier, based on prior knowledge available on the output label structure of the target domain. All these source classifiers with different biases are subsequently used to span the target label space (see Section III-A). Once the target label space is formed, TOLL proceeds to simultaneously learn the weight of each source classifier and the target classifier for the domain of interest, in a manner where the margin of separation in the target label space is maximized (see Section III-B).

A. Generating Target Label Space from Multiple Sources Using the complimentary labeled data from multiple relevant source domains, appropriate target classifier can be derived from an ensemble of source classifiers for the purpose of target unlabeled data prediction. In the following, the procedure to generate the label space for a given set of target unlabeled data, referred as target label space, is discussed. An outline of the procedure is summarized in Algorithm 1. Given the availability of m source domains, the design process begins with the construction of a classifier in each source domain and also a classifier for each combination of 2, 3, ..., (m − 1) source domains, until S possible combinations of the m source domains are explored, i.e., S = m i=1 m!/i !(m − i )! classifiers are trained. Diverse forms of source classifiers can be trained, either based on SVM, Gaussian process [41], transductive SVM [35], or any other variants of supervised, semisupervised, or DA methods. Without loss of generality, we consider supervised SVM in this paper. Like most models, each of the S source classifiers includes a bias term b such that the decision boundary is not restricted to intersect only at the origin. TOLL leverages from the biases of source classifiers to generate label vectors y = [y11 , ..., y1(K −1), ..., yu1 , ..., yu(K −1) ] for the (K −1) target unlabeled data, where [yi1 , ..., yi ] ∈ {−1, 1} K −1 are the extended class labels of the i th sample. As the source classifiers may be trained from source domains that are of differing distributions to the target domain, it is more beneficial to determine the bias b based on the target data. Hence, we propose to define the bias b of source classifiers in such a way where the label vector y of the target unlabeled data satisfies the following balance constraint: u u K −1 j (1 − β) ≤ I I yi = 1 + 1 = k j K i=1 u ≤ (1 + β)∀k ∈ {1, ..., K } K where β is the hyperparameter to restrict the imbalanced class size qk for the kth extended class label in the label vector. This constraint can be implicitly imposed by sorting the classifier’s decision outputs of the target unlabeled data and forms at most Z = (2βu/K ) K −1 unique label vectors. Hence, the target label space is spanned by S × Z label vectors. With the S × Z label vectors, the target label space, M, is then defined as follows:

1866


Algorithm 1 Generation of the Target Label Space 1: Inputs: F (a set of source classifiers (precomputed) trained from each unique combination of source domains), β controls the imbalance of label vectors 2: Outputs: Y (a set of generated label vectors for target unlabeled data) 3: for all f s ∈ F do 4: indexes=sort( f s (xi ), ..., f s (xu )) 5: z = 1, q0 = 1; u 6: for each unique set of {qk | K (1 − β) ≤ qk ≤ Ku (1 + K β) , k=1 qk = u, ∀k = 1, ..., K } do 7: create a vector ysz ∈ u(K −1) 8: for C = 1, ..., K do C 9: assign ysz with the indexes C−1 k=0 qk to k=1 qk as extended class label C 10: end for 11: Y = Y ∪ ysz ; z = z + 1; 12: end for 13: end for 14: return Y M=

yˆ =

Z S

gzs ysz |

s=1 z=1

u (1−β) ≤ K

u i=1

Z S

error allowable before ξi (the slack variable) is penalized, and C is the regularization parameter that tradeoffs between model complexity and empirical risk. As the hinge loss employed in the inner minimization [i.e., enclosed by {} in (5)] is nonincreasing, the ordered constraints on θ1 ≤ θ2 ≤ · · · ≤ θ K −1 are implicitly fulfilled (see the proof of Theorem 2 in [38]). With the outer minimization of (5) over yˆ , the optimal decision function w φ(x) is essentially the solutions with decision boundaries lying in the low-density regions of the target unlabeled data [36]. Furthermore, TOLL learns the weight of each label vector ysz (as predicted by a source classifier) in (5) by minimizing the structural risk involving the target samples only. In this manner, the kernel expansion of the target classifier will only be defined by data samples in the target domain. In the event that some target labeled data do exist, such information can be easily incorporated into TOLL by simply imposing the labels of the target labeled data for the available ysz . C. Optimization in TOLL In the following, the detailed steps to solve (5) of TOLL are presented. Initially, the Lagrangian of the inner minimization in (5), enclosed by {}, can be written as follows: u K −1

gzs = 1; gzs ≥ 0 ,

s=1 z=1

⎡⎛⎛ ⎞ ⎞ ⎤ K −1 sj u I ⎣⎝⎝ I yzi = 1 ⎠ + 1⎠= k⎦ ≤ (1+β) K j

∀k = 1, . . . , K ∀z = 1, . . . , Z , s = 1, ..., S

(4)

L=

1 1 w22 + θ 22 − ρ + C ξik 2 2 i=1 k=1

−

−1 u K

αik (yik (w φ(xi ) − θk ) − ρ + ξik )

i=1 k=1

−

−1 u K

λki ξik

(6)

i=1 k=1

where the importance of each source classifier, ysz , is weighted by gzs , and without loss of generality, the extended binary class labels of Xu are denoted by y = (K −1) (K −1) [y11 , ..., y1 , ..., yu1 , ..., yu ] . In addition, M forms the convex hull of the target output label space [42].

where αik ≥ 0 and λki ≥ 0 are the Lagrangian multipliers of the inequality constraints. According to the KKT condition, we have the following: w=

min

w,ρ,ξ

s.t.

αik yik

θk = −

(7) (8)

i=1

C = αik + λki −1 u K

αik = 1.

(9) (10)

i=1 k=1

Substituting (7)–(10) back into (6), we have the following:

min

yˆ ∈M

αik yik φ(xi )

i=1 k=1 u

B. Proposed Formulation To alleviate the source sample selection bias, we propose the minimization of the expected risk by taking only the target unlabeled samples into consideration. Particularly in TOLL, learning the labels of the unlabeled samples is conducted by minimizing the following structural risk using hinge loss function of SVM as follows.

−1 u K

K −1 u 1 1 w22 + θ 22 − ρ + C ξik 2 2

yîk (w φ(xi ) − θk ) ≥ ρ −

maxα −

i, j =1 k,k =1

k=1 i=1 k k ξi , ξi ≥ 0

∀i = 1, ..., u, ∀k = 1, ..., K − 1, θk ≤ θk+1 ∀k = 1, ..., K − 1

K −1 u 1 k k k k αi α j yi y j (K (xik , xkj )) 2

K (xik , xkj )

(5)

where φ(x) maps x into a high-dimensional space, yîk ∈ [+1, −1]. w φ(x) is the predictive function, ρ is the maximum

where = φ(xi )φ(x j ) + I [k = k ]. We further define α = {α11 , ..., α1K −1 , ..., αu1 , ..., αuK −1 }, and A = K −1 k {α| ui=1 k=1 αi = 1, 0 ≤ αik ≤ C, ∀i = 1, ..., u, ∀k = 1, ..., K − 1}, then (5) is simplified as follows: 1 min max − α (K yˆ yˆ )α . (11) α∈A 2 yˆ ∈M


Algorithm 2 TOLL 1: Inputs: M2 (a set of source label vectors generated by Algorithm 1) 2: α = u1 1, then find the most violated yt in (16) and let S = {yt } 3: repeat 4: Find optimal d ∈ S and α in (15) via MKL 5: Find the most violated yt by (16) and set S = S ∪ yt 6: until convergence 7: return dt , yt ∀t : yt ∈ S

As A and M are both compact sets and according to the minimax theorem [43], swapping the order of the min and max in (11) is equivalent to 1 max min − α (K yˆ yˆ )α . (12) α∈A 2 yˆ ∈M In addition, (12) can be reformulated as follows: max max − , α∈A 1 (13) s.t. ≥ α (K yt yt )α∀yt ∈ M . 2 In addition, the dual form of the inner maximization of (13) is 1 max min − α dt K yt yt α (14) α∈A d∈D 2 t :yt ∈M

where d is a vector of Lagrangian multipliers dt and D = {d| t :yt ∈M dt = 1, dt ≥ 0 ∀t : yt ∈ M} is the domain of d. As D and A are both compact sets, swapping the order of the max and min in (14) is equivalent to 1 min max − α dt K yt yt α . (15) α∈A d∈D 2 t :yt ∈M

The set M in (15) corresponds to the base kernels of the multiple kernel learning (MKL) problem [44]. Hence, (15) can be solved using the efficient MKL solvers [45]. In the presence of a significant number of source classifiers, solving by MKL may not be efficient. Fortunately, as it is unlikely for all of the constraints in (15) to be active simultaneously at the optimal solution, the efficient cutting plane method can be efficiently deployed [46] in solving (15) (see Algorithm 2). The algorithm begins with the initialization of α = 1/u1 and then locates the most violated constraint of (16) that also fails the constraint in (13). Theorem 1: The most violated constraint of (13) for a fixed α is then 1 arg max α (K yy )α y∈M2 2 where M2 = {y11 , ..., y1Z , ..., y1S , ..., y SZ }. (16) Proof: Let f (y) = 1/2α (K yy )α. Since f () is a convex function, f ((1 − λ)yi + λy j ) ≤ (1 − λ) f (yi ) + λ f (y j ), ∀yi , y j ∈ M2 , λ ∈ [0, 1] according to the convexity property. If the predicate f (yi ) > f (y j ), then f ((1 − λ)yi + λy j ) ≤ f (yi ). Similarity, if f (yi ) < f (y j ), then

1867

f ((1 − λ)yi + λy j ) ≤ f (y j ). Therefore, the predicate f ((1 − λ)yi + λy j ) ≤ max( f (yi ), f (y j )) holds. Through induction [42], f (λ11 y11 + · · · + λ1Z y1Z + · · · + λ1S y1S + · · · + λ SZ y SZ ) ≤ S Z s s (arg maxy∈M2 f (y)) given s=1 z=1 λz = 1 and ∀λz ∈ [0, 1]. To solve (16), no numerical optimization solver is needed as the maximum objective value is simply obtained by computing all the objective values in the set M2 and then the most violated yt corresponds to that with the highest value among those computed. Hence, the first active constraint is chosen based on the most violated yt . Thereafter, the current set of selected constraints is solved via MKL before obtaining the next most violated constraint for inclusion into the set of constraints. The process of finding the next most violated constraint is repeated until convergence. Empirically, only a few iterations is needed for Algorithm 2 to converge. The overall time complexity of TOLL is O(T J (((K − 1)u)2.3 )), where J and T are the iterations incurred by the cutting plane method and MKL, respectively. O(((K − 1)u)2.3 ) is the empirical complexity of SVM training. From our experience in running the experiments, J is generally less than a dozen and T is usually small as it depends on J . Upon convergence, the labels of Xu can be derived as follows. For a K -classproblem with K > 2 and by replacing f (x k ) in (3) as sign( t :yt ∈S dt ytk ), the class label of x becomes K −1 I [sign( t :yt ∈S dt ytk ) = 1])+1). This type of labeling (( k=1 is based on weighted voting in which each vote carries a learned weight dt . In addition, for a binary problem (i.e., K = 2), the labels of the target domain can be recovered using singular value decomposition on Y = t :yt ∈S dt yt yt as √ D1 V1 [29], [31], where D1 and V1 are the largest eigenvalue and eigenvector, respectively. Then, the polarity of the groups learned by V1 can be determined with a majority vote by the source classifiers. D. Generalization Error Bound of TOLL In this section, we analyze the generalization absolute error bound of the proposed TOLL in the target domain. Initially, we define the joint distributions of the sth source domain and the target domain as P s and P t , respectively. Similarly, the marginal distributions of the sth source domain and the target domain are denoted by Ds and Dt , respectively. The expected errors of the sth source domain and the target domain are then given by

s (h) = E (x,y)∼P s I [sign(h(x)) = y] and

t (h) = E (x,y)∼P t I [sign(h(x)) = y] respectively. Note that I [sign(h(x)) = y| is considered as zero–one loss function. Similarly, the expected errors for the kth extended class of the sth source domain and target domain are given by

ks (h) = E (x,y)∼P s I [sign(h(xk )) = y k ] and

kt (h) = E (x,y)∼P t I [sign(h(xk )) = y k ]

1868


respectively. In addition, given two hypotheses, h 1 and h 2 , we define t (h 1 , h 2 ) = E x∼Dt I [sign(h 1 (x)) = sign(h 2 (x))]. In the following, we first derive the generalization absolute error bound for a target hypothesis of ordinal regression in Theorems 2 and 3. After that, the generalization absolute error bound on the target data for TOLL will be derived in Theorem 4. Theorem 2: A hypothesis h of ordinal regression has the following generalization absolute error bound in the target domain: K −1

kt (h) ≤

k=1

K −1

( ks (h) + dks (h) + λsk )

(17)

k=1

where λsk = minh ∗ ∈H ks (h ∗ )+ kt (h ∗ ) and dks (h)

ks (h, h ∗ )|, and |.| is an absolute operator.

= | kt (h, h ∗ )−

Proof: From [47], a hypothesis h has the following generalization error bound in the target domain:

t (h) ≤ s (h) + d s (h) + λs

(18)

where λs = minh ∗ ∈H s (h ∗ )+ t (h ∗ ) and d s (h) = | t (h, h ∗ )−

s (h, h ∗ )|. Using the extended binary classification model, the generalization error bound of the hypothesis h on the kth extended class is

kt (h) ≤ ks (h) + dks (h) + λsk .

(19)

Combining the error bounds for all ordinal labels, the proof is completed. Theorem 3: For a margin > 0, with a probability of at least 1 − δ, a hypothesis h of ordinal regression has generalization absolute error bound in the target domain as follows: K −1 K −1

kt (h) ≤ (ˆ ks (h) + λsk + dks (h)) (20) k=1

k=1

s where empirical ˆks (h) = 1/n s ni=1 I [yisk h(xisk ) ≤ s ] + s ,√whereas the √ confidence empirical risk = s s O(log n / n , R/, log 1/δ) is such that K (x, x)+1 ≤ R 2 , ||w|| + ||θ|| ≤ 1, and h(xk ) = w x − θk . Proof: From [48, Th. 6], a hypothesis h of ordinal regression has the following source generalization absolute error bound: K −1 K −1

ks (h) ≤

ˆks (h). (21) k=1

k=1

Next, by substituting (21) in (17), the proof is obtained. Theorem 4: A hypothesis h of ordinal in the Z S regression s h s and g proposed framework, TOLL, with h = s=1 z=1 z z S Z s s s=1 z=1 gz = 1, where h z is derived from multiple source domains, has generalization absolute error bound in the target domain as follows: K −1

kt (h)

s=1 z=1 k=1

gzs (ˆ ks (h) + λsk + dks (h)).

gzs

K −1 k=1

kt (h) ≤ gzs

K −1

(ˆ ks (h) + λsk + dks (h)).

(23)

k=1

S Z s Then, s=1 z=1 gz = 1 of (23), the proof is completed. Using the generalization bound derived in (22), we proceed to discuss the solution obtained by the strategy in TOLL. With Algorithm 1, TOLL trains a classifier that minimizes the structural risk for each source domain, and then attains numerous hypotheses from the multiple relevant source classifiers, by projecting their bias parameters onto the target unlabeled data. Next, the weight gzs is obtained for each hypothesis via Algorithm 2. As the hypotheses are obtained from the source domains, it is reasonable for ˆks (h) to be small. 4 Furthermore, although λsk is unknown but to be in consistent with previously reported DA works, we shall assume λsk to be small. As both ˆks (h) and λsk of (22) are small, the remaining S Z K −1 s s term to minimize shall reduce to s=1 z=1 k=1 gz dk (h), where dks (h) = | kt (h, h ∗ ) − ks (h, h ∗ )|. In what follows, we present the details to optimize this term. In particular, there are two cases to analyze dks (h), namely, kt (h, h ∗ ) ≤ ks (h, h ∗ ) and kt (h, h ∗ ) ≥ ks (h, h ∗ ). Remark 1: When kt (h, h ∗ ) ≤ ks (h, h ∗ ), we have dks (h) = s

k (h, h ∗ ) − kt (h, h ∗ ). As ks (h, h ∗ ) ≤ ˆks (h) + ks (h ∗ ) (based on triangle inequality) in which ks (h ∗ ) is a part of λsk that is assumed to be reasonably small, and ˆks (h) [defined in (20)] can be estimated and chosen to be small, 4 thus the bound for dks (h) should also be reasonably small. Recall that minimizing (5) over yˆ ∈ M is equivalent to choosing a label vector yˆ that enforces a decision boundary that lies in lower density regions of the target unlabeled data. It is thus expected for kt (h, h ∗ ) to be small according to the cluster assumption [35], [36]. Remark 2: When kt (h, h ∗ ) ≥ ks (h, h ∗ ), we have dks (h) =

kt (h, h ∗ ) − ks (h, h ∗ ). Hence, minimizing kt (h, h ∗ ) leads to the minimization of dks (h) as well. In summary, proposed in TOLL to Zensemble K −1 strategy S the s d s (h) alleviates the risk of g minimize s=1 z=1 k=1 z k choosing a poor source hypothesis. IV. E XPERIMENTAL S TUDY In this section, we will describe the setting on class ratios of the source and target domains, the data sets (sentiment, newsgroup, and email) used for evaluations, the state-of-theart algorithms considered in this paper and the evaluation metric used to measure the performance in Section IV-A–D, respectively. A. Setup on the Class Ratios of the Source and Target Domains In practice, the true class distribution of the target domain is usually unknown. Thus, we begin with the investigation on

≤

k=1 Z K −1 S

Proof: As TOLL imposes the inequality constraint of gzs ≥ 0, the following holds for any gzs on (20):

(22)

4 If the empirical risk of a source domain is high, this source domain can be removed from being considered to form the hypotheses of TOLL. For simplicity, we assume the empirical risks of all source domains are acceptable hence no removal is needed.


the effects of various class ratios of the target data on the prediction accuracies in this paper. To carry out this research, the term target positive class ratio (TPCR) is introduced for the purpose of analyzing the impacts of various class ratios in the target domain, on the diverse learning algorithms considered. For binary problem (K = 2), TPCR defines the number of positive samples in the target domain. For example, in a set of 1000 target samples, a TPCR of 0.3 implies 300 samples are positive and the remaining are negative. In the experimental study, TPCR values of 0.3, 0.5, and 0.7 are investigated. In the K = 4 ordinal regression problem, the samples with labels belonging to the first half of the K classes are treated as positive and the rest of the sample are treated as negative. In addition, each class in their respective positive/negative group has equal number of samples. For example, a class 4 problem with 1000 samples under the setting of TPCR = 0.3 implies that each of classes 1 and 2 have 150 samples, whereas classes 3 and 4 have 350 samples each. In addition to investigate the various class distributions of the target domain, we also study the various class ratios of the source domains, as source sample selection bias is likely to be observed in the trained classifier that exhibits properties of steering toward the distribution of the source domain. Specifically, the imbalanced class ratio between the source-totarget domains is expected to aggravate the degree of source sample selection bias [49], [50]. Hence, in this paper, the term source positive class ratio (SPCR) is first introduced and defined to denote the number of positive samples in the source domain. In the experimental study, the robustness of each stateof-the-art algorithm for different configurations, particularly at SPCR values of 0.2, 0.4, 0.6, and 0.8, are investigated for the different class ratios between the source and target domains. B. Multidomain Sentiment, Newsgroup, and Email Data Sets On sentiment data set, we consider the cases where K = 2 and K = 4. The data set is prepared as reported in [14]. It comprises four categories of product reviews: book, DVDs, electronics, and kitchen appliances from Amazon.com. For each task, one category is posed as the target domain, whereas the rest as related source domains. Each review is marked with a five-star rating scale, where a higher star rating implies a better feedback. In [14], the three-star rating data were removed to avoid ambiguities in the binary classification. In the context of binary (K = 2) problem, the negative samples are made up of one- and two-star ratings, whereas the rest of the ratings form the positive samples. Hence, the task is to categorize the target testing data into positive and negative reviews. As for the context of (K = 4) problem, the task is to categorize the target testing data into star ratings 1, 2, 4, and 5. In each of the tasks for both (K = 2) and (K = 4) problems, 2000 samples are randomly selected from each source domain to form the labeled data and 500 samples from the target domain as unlabeled data. On the newsgroup and email data sets, (K = 2) is considered. Newsgroup data set consists of three main categories: comp, rec, and sci. Each main category is then separated into Source 1, Source 2, Source 3, and target (see Table I),

1869

TABLE I G ROUPING OF S OURCE AND TARGET D OMAINS IN N EWSGROUP D ATA S ET

resulting in three tasks: comp versus rec, comp versus sci and rec versus sci. In particular, each task is to categorize the target testing data into their respective categories. The email data set considered here is available at the ECML/PKDD 2006 discovery challenge.5 The source and target domains consist of spam and nonspam emails from user and public inboxes, respectively. The task is then defined as to categorize the target testing data into spam and nonspam emails. In each of the tasks, 1000 samples are randomly selected from each source domain to form the labeled data, whereas 500 samples from the target domain as unlabeled data. As the problems of interest are text data sets, they are preprocessed with single terms and biterms extracted, stopwords removed, and stemming and normalizing of each feature performed. Therefore, each feature of the sample is represented by its respective tf-idf value. Further, the linear kernel is employed in the experimental study. C. State-of-the-Art Algorithms Considered In this paper, several state-of-the-art algorithms are investigated for diverse TPCR and SPCR settings considered on data sets involving three source domains (sentiment and newsgroup data sets) or one source domain (email data set) and a target domain as follows. 1) 1S-SVM: Each source domain is trained using the SVM6 and the lowest balanced absolute error among the classifiers is reported. 2) 2S-SVM: Each unique pair of source domains is trained using the SVM and the lowest balanced absolute error among the classifiers is reported. 3) MCC: It is a representative of DA method that linearly combines all source classifiers trained based on the SVM [20]. As this paper involves three source domains, MCC is equivalent to a 3S-SVM. 4) Label Generating (LG)-MMC 7 [31]: It maximizes the margin separating two opposite clusters of the target unlabeled data without the use of any label information available in the source domains. As LG-MMC does not use any class label information, we assume the class labels assigned to the respective clusters to be the true class labels that will give the lowest balanced absolute error. As LG-MMC does not consider the ordinal constraint, LG-MMC is only used on the binary problems (i.e, K = 2). 5 http://www.ecmlpkdd2006.org/challenge.html. 6 The ordinal SVM code used is available publicly http://www.work.caltech.

edu/∼htlin/program/libsvm/#ordinal 7 The program is downloaded from http://lamda.nju.edu.cn/files/LGMMC_v2.rar

1870


5) KMM: Addresses the marginal distribution differences between a single source domain and a target domain by reweighting each of the source samples in the reproducing kernel Hilbert space (RKHS) such that the MMD criterion defined on the source and target domains [7] is minimized.8 A weighted SVM is then trained on the source domain using the derived weight of each sample. One KMM is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 6) TCA: Assumes there exists some feature maps with similar predictive distributions between a single source domain and a target domain, i.e., P S (y|x) ≈ P T (y|x), where superscripts S and T refer to source domain and target domain, respectively. Hence, it learns a set of transfer components in the RKHS based on the MMD criterion, and subsequently, the SVM is trained on the source domain in this RKHS [18]. One TCA is trained for each source domain and the lowest balanced absolute error among the classifiers is reported. 7) TOLL: Learns the labels of the target unlabeled data by maximizing the margin of separation in the target data based on the label space spanned by a linear combination of source classifiers, as shown in Fig. 1. The parameters of all methods are configured by means of the k-fold cross-source domain validation scheme suggested in [51]. It denotes an extension of the standard k-fold cross validation for DA learning. Here, k is the number of source domains, i.e., k = m. Specifically, each partition represents a source domain in k-fold cross-source domain validation. In addition, β is fixed as 0.4 in LG-MMC and TOLL (as given in Algorithm 1). D. Evaluated Performance Metric For ordinal problem, the absolute error is commonly used as the criterion for defining accuracy, which gives the absolute difference between the predicted label and the ground truth label. In particular, the smaller the absolute error, the nearer the predicted label are to the ground truth label. In cases where the source and target class distributions, however, differ, the balanced error can be considered [52], [53]. Taking this cue, the balanced absolute error is considered as the evaluation criterion for the ordinal regression problem considered in this paper, which is defined as follows: ⎛ ⎞ K ⎜ 1 ⎜ ⎜ ⎜ K k=1 ⎝

In this section, we first perform a study on the validity of the cluster assumption used in TOLL, before proceeding with the discussion and analysis on the experimental results for the algorithms considered. Finally, an experimental study is carried out to investigate the time complexities of the compared state-of-the-art methods. A. Case Study on Cluster Assumption in TOLL In this section, we analyze the validity of the cluster assumption criterion employed in TOLL. Recall that TOLL begins with a computation of the source classifiers for generating the set of potential class labels ysz for the target unlabeled data, as outlined in Algorithm 1. We plot the margin of separation 2/||w|| for each ysz obtained from solving minw,ρ,ξ in (5) with the kitchen appliances serving as the target domain. The plots of ysz generated by 1S-SVM as the source classifiers trained on the source domains: book, DVDs, and electronics, respectively, are shown in Fig. 2(a)–(c). The plots of ysz generated by 2S-SVM as source classifiers from two combinatorial source domains: book and DVDs, book and electronics, and DVDs and electronics, respectively, are shown in Fig. 2(d)–(f). Then, the plot of ysz generated by 3S-SVM (MCC) as the source classifier on all the source domains is shown in Fig. 2(g). In particular, the plots of ysz of a particular run using kitchen appliances as the target domain with the settings of K = 4, TPCR = 0.5, and SPCR = 0.4 are shown in Fig. 2. The line in each Fig. 2(a)–(g) regresses the linear trend of the plots of ysz . From the slope of the lines, it is indicative that the increasing margin (2/||w||) is associated with a general decrease in the balanced absolute error. These results imply that the appropriate choice of source classifiers based on the large target margin criterion can minimize the balance absolute error in the target domain. Further, the cluster assumption made in TOLL to minimize the target generalization absolute error (Theorem 4) by means of maximizing the target margin is valid. In addition, although the plots of ysz in Fig. 2(a)–(g) share similar range of margin values, their balanced absolute errors can be observed to differ much. This highlights that when no a priori knowledge on choosing the most suitable source domain is available, a strategy of source domains ensemble as proposed in TOLL, thus serves as important for robust accuracy prediction. B. Experimental Result and Discussion on Sentiment Data Set

u

⎟ ⎟ i=1 | yî − yi |I [yi = k] ⎟ ⎟ u ⎠ I [yi = k]

V. E XPERIMENTAL R ESULTS AND D ISCUSSION

(24)

i=1

where | · | is the absolute function. For each of the tasks, 10 independent runs are conducted and the average results are reported. 8 The weights of the source samples are learned using quadratic programming, as stated in (12) of [7].

Figs. 3 and 4 summarize the balanced absolute error of the target unlabeled data obtained on the sentiment prediction data set for K = 2 and K = 4, respectively. The three subfigures at the left are the results obtained on the target domain for a positive class ratio (TPCR) of 0.3, whereas the remaining three subfigures at the right present the results for TPCR of 0.5. Figs. 3(a) and (d), and 4(a) and (d) summarize the balanced absolute error of the target domain on the DVDs data set for varying degrees of SPCR in the source domain. On the other hand, the results for the case where the electronics and kitchen


1871

Fig. 2. Margin versus absolute error for different source domains while having the kitchen appliances as the target domain of interest with a setting of K = 4, TPCR = 0.5, and SPCR = 0.4. The points in each subfigures are the class labels ysz obtained by Algorithm 1 and each ysz has a respective margin obtained from solving minw,ρ,ξ in (5). The x-axis is the margin 2/||w||, whereas y-axis is the balanced absolute error. B, D, and E symbolize the book, DVDs, and electronics, respectively. Please refer to the text for more details.

appliances data sets as the target domain of interest, are shown in Figs. 3(b), (e) and 4(b), (e), and 3(c), (f) and 4(c), (f), respectively. For the sake of conciseness, the experimental results of the book data set as the target domain is omitted from this paper as similar trends to the other data sets studied are observed on the considered algorithms. In addition, as the results for TPCR = 0.7 is symmetrical to that of TPCR = 0.3, all other target domains for TPCR of 0.7 are also omitted. As observed from Fig. 3, LG-MMC exhibits the worst balanced absolute error among all the methods under investigation. This shows that unsupervised approach based on maximal margin separation of the unlabeled data without any use of label information is deemed to be less effective than DA methods because of the abundance of labeled data from other related source domains that can be appropriately used to compliment class predictions on target unlabeled data. The results obtained thus conclude the effectiveness of DA methods on sentiment data in the absence of target label information.9 It can be observed from the results in Figs. 3 and 4 that the 1S-SVM underperforms MCC (i.e., 3S-SVM) and 2S-SVM in general. Nevertheless, when source domains are used at equal weights (i.e., 2S-SVM and MCC, respectively), the results in Figs. 3 and 4 show significant degradations, because of the imbalanced class ratios between the source and target domains. When the SPCR setting approaches either extremes (i.e., 0.2 and 0.8) of the experimental study considered, the performances of most of the methods, except TOLL, can be observed to degrade significantly. As both KMM and TCA operate by minimizing the marginal distribution differences between the target and source domains according to the MMD criterion [7], the degradations in performances thus show that the necessary assumption made on similar predictive distributions between 9 LG-MMC does not appear in Fig. 4 (K = 4) as it does not consider ordinal class labels.

source and target domains for KMM and TCA does not hold on the sentiment data. Simultaneously, these results also show that imbalanced class ratios between the source and target domains lead to source sample selection bias. From Fig. 4, KMM and TCA are noted to have attained lower performances than the others in general. The former method operates by reweighting the source labeled data hence as to match the marginal distributions of the target data, while assuming a common class distribution shared by the source and target domains. Hence, the poor prediction results are observed when their class distributions are dissimilar. The latter method remaps the kernel space in a way as to minimize the distance between the source and target domains such that samples of star rating 1 and 2 are reconfigured to be closer together, and similarly for the samples of star rating 4 and 5. This explains why the performances obtained by TCA, in Fig. 5, is noted to be poor on ordinal problems, while exhibits rewarding results on the binary sentiment problem, as observed in Fig. 3. Although the performances of the DA methods are observed to suffer from source sample selection bias because of the differing class ratios between the source and target domains, TOLL is observed to perform robustly across the range of SPCR and TPCR settings considered. TOLL also attains the lowest balanced absolute error, in relation to all the other methods for the extreme configurations of SPCR, i.e., 0.2 and 0.8, as observed from all the subfigures. This implies that TOLL is capable of choosing a robust linear combination of source label vectors that represent the label space of the target unlabeled data. It maximizes the margin of separation solely based on target unlabeled data in the target label space that is spanned by label vectors generated from multiple independent source classifiers (i.e., the bias parameter of each source classifier is projected on the target unlabeled data).

1872


Fig. 3. Balanced absolute error for K = 2 on sentiment data set. Left: TPCR = 0.3. Right: TPCR = 0.5. The x-axis is the various domain’s SPCR settings and the y-axis is the balanced absolute error. Please refer to the text for more details.

Fig. 4. Balanced absolute error for K = 4 on sentiment data set. Left: TPCR = 0.3. Right: TPCR = 0.5. The x-axis is the various domain’s SPCR settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details.

It is also worth highlighting that although TCA outperforms all other algorithms at TPCR of 0.3 with SPCR of 0.4 on the sentiment (K = 2) data set [see Fig. 3(a)–(c)], the reported TCA results are chosen from the best among three results, each of which is obtained by applying TCA on different source domains. The three results on different source domains, each of which is trained using TCA with DVDs as the target domain, are shown in Fig. 5. Furthermore, the balanced absolute error of each source domain is trained using SVM, which is denoted as 1S-SVM, is also shown in Fig. 5. In practice, it is nontrivial to determine in advance which source domain is the most suitable for the target domain beforehand, especially in the absence of prior knowledge on the target domain. TOLL thus fills this gap by providing an ensemble of suitable source classifiers to attain improved predictive

performance in the target domain of interest. Therefore, in general, the results in Fig. 6 show that TCA performs much worse than TOLL on SPCR settings of 0.2, 0.6, and 0.8. C. Newsgroup and Email Experimental Result Discussions The results for the newsgroup data set are shown in Fig. 7. We can observe that LG-MMC achieves decent performances on the newsgroup data. Particularly, LG-MMC reports improved balanced absolute error over 1S-SVM, 2S-SVM, MCC, and KMM for SPCR of 0.2 and 0.8 in most of the subfigures illustrated. This implies that solely learning from target unlabeled data can sometimes be more beneficial than enlisting the additions of labeled samples from other source domains, especially when the target data are well separated


1873

Fig. 5. DVDs as target domain in sentiment experiments for K = 2. Comparisons among 1S-SVM, TCA, and TOLL. 1S-SVM-X or TCA-X where X is the source domain being used to classify DVDs’ test data.

(cluster assumption). 1S-SVM also operates based on maximizing the margin of separation but training is concentrated on the source domain where source sample selection bias creeps in. On the other hand, KMM improves the results of 1S-SVM by means of using the MMD criterion but still fares poorer than LG-MMC. On the other hand, TCA and TOLL achieve significantly lower balanced absolute error than LG-MMC, 1S-SVM, and KMM. In overall, TOLL emerged as superior to all other methods in all experimental settings considered, except on the rec versus sci task where TPCR = 0.3. The details of rec versus sci task where TPCR = 0.3 is shown in Fig. 7 and it is observed that TCA performs lower relative to TOLL if source 2 or 3 is considered in the training process for classifying the target unlabeled data. Therefore, the selection of appropriate source domains in TCA is an essential task that bears great impacts on its effectiveness. In practice, it is, however, difficult to determine the most appropriate source domain for TCA beforehand. In general, the results in Fig. 6 show that TOLL displays high robustness and superior in prediction accuracy throughout the entire range of SPCR and TPCR considered. The results on the email data set are shown in Fig. 8. MCC, KMM, and TCA report the best accuracies when SPCR is at 0.2. The accuracies of MCC, KMM, and TCA, however, exhibit declining trends as the SPCR approaches to 0.8. This observation is due to the task of detecting spam emails (positive samples) being easier than identifying nonspam emails (negative sample). On the other hand, as LG-MMC learns only from target unlabeled data, it does not suffer from source sample selection bias as observed in the figure when much less negative samples are available than positive samples (i.e., SPCR of 0.8). Nevertheless, LG-MMC is observed to exhibit poor accuracy across the entire range of SPCRs. It is worth mentioning that TCA reports the best accuracy among all methods at SPCR of 0.2. Therefore, minimizing the marginal distribution differences between the target and source domains, according to the MMD criterion, through finding the transfer components in RKHS do help. Nevertheless, the approach still suffers performance degradations when the class distributions between source and target domains differ. On the other hand, TOLL achieves better performances than all the methods considered for SPCR in the range of SPCR 0.4–0.8 and displays robust results across the entire range of SPCR settings. Last

Fig. 6. Newsgroup experimental results. Left: TPCR = 0.3. Right: TPCR = 0.5. The x-axis is the various domain’s SPCR settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details.

but not least, we also do a Wilcoxon signed-ranks test [54] on all the results on Figs. 3, 4, 6, and 8, which report that TOLL is significantly better than other baselines with 99% confidence. D. Comparison of the Time Complexities of the State-of-theArt Methods In the following, we discuss the theoretical analysis of the following methods. 1S-SVM, 2S-SVM, and MCC use SVM as the classifier, hence they exhibit a time complexity of O(((K − 1)n)2.3 ), which is assumed as the empirical complexity of SVM training, where n is the number of source labeled data. KMM is solved using quadratic programming with a time complexity of O(n 3 ). TCA, on the other hand, is solved with eigen decomposition and has a time complexity of O((n + u)3 ). For TOLL, the

1874


Fig. 7. Rec versus Sci in newsgroup experiments. Comparisons among 1SSVM, TCA, and TOLL. 1S-SVM-X or TCA-X where X is the source domain (see Table I) being used to classify the target test data.

Fig. 8. Spam email experimental results. The x-axis is the various domain’s SPCR settings and the y-axis is the balanced absolute accuracy. Please refer to the text for more details.

TABLE II T RAINING T IME ( S ) OF VARIOUS M ETHODS ON THE K ITCHEN A PPLIANCES (K = 4) D ATA S ET W ITH TPCR = 0.5 AND SPCR = 0.4

computational complexity is O(T J (((K − 1)u)2.3 )), where K is the number of classes, u is the number of target unlabeled data, whereas J and T are the number of iterations incurred by the cutting plane method and MKL, respectively. Thus, TOLL takes a factor of J T (u/n)2.3 , (J T (K − 1)u 2.3 /n 3 ), and (J T (K − 1)u 2.3 /(n + u)3 ) over MCC, KMM, and TCA, respectively. Hence, when the product of J , T , K , and u is much greater than n, TOLL will display a higher computational complexity than the other methods. On the other hand, when an abundance of source data is available such that n u, the proposed TOLL is faster. To verify the theoretical analysis, an experimental study is carried out to investigate the training times of the methods on kitchen appliances (K = 4) as target domain. The training size of both 1S-SVM and KMM is 2000, whereas the training sizes of 2S-SVM, MCC, TCA, and TOLL are 4000, 6000, 2500, and 500, respectively. The training times (in term of seconds) of the aforementioned methods are detailed in Table II. As 1S-SVM, 2S-SVM, and MCC share the same computational complexity, the method with the most training samples is expected to have

a longer training time, as observed in Table II. Furthermore, as the source labeled data increases, which leads to have a smaller fraction of target unlabeled data to source labeled data, the fraction of training time of TOLL to other method is also expected to be smaller. Therefore, the observations from Table II show that the fraction of training time of TOLL to MCC is smaller than that of TOLL to 2S-SVM and TOLL to 1-SVM. It is also worth mentioning that the training times of TCA, KMM, and 1S-SVM are consistent with the theoretical aforementioned computational complexities of those methods. Last but not least, TOLL takes the longest time to train a classifier. Nevertheless, TOLL is observed to be robust across the 81 tasks as shown in Figs. 3, 4, 6, and 8. VI. C ONCLUSION A core challenge of transfer learning in attaining reliable classifier from relevant source domains is the induction of source sample selection bias, such that the eventual classifier trained often steers toward the distribution of the source domain. In addition, this bias is deemed to become more severe on data involving multiple classes. Considering this cue, we proposed a TOLL paradigm that predicts the ordinal labels of target unlabeled data by spanning the feasible solution space with ordinal classifiers from multiple relevant source domains. In contrast to previous works, the maximum margins between two consecutive ordinal classes were employed as the criterion for selection or fusions or both of appropriate source ordinal classifiers when designing the target classifier. In this manner, the proposed approach thus learned a target ordinal classifier that involved only the kernel expansion of the target data. Through comprehensive experimental studies, TOLL was shown to display superiority and robustness across the entire range of imbalanced source and target class ratio settings when pitted against several state-of-the-art methods, which is in contrast to other counterpart methods, which suffered significantly in the prediction accuracies. Last but not least, TOLL is significantly better than all the compared methods over all the data sets considered in the experimental study based on Wilcoxon signed-ranks test [54] with 99% confidence. R EFERENCES [1] P. Wu and T. G. Dietterich, “Improving SVM accuracy by training on auxiliary data sources,” in Proc. ICML, 2004, pp. 871–878. [2] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” J. Stat. Planning Infer., vol. 90, no. 2, pp. 227–244, Oct. 2000. [3] M. Sugiyama and K.-R. Müller, “Input-dependent estimation of generalization error under covariate shift,” Stat. Decision, vol. 23, no. 4, pp. 249–279, 2005. [4] A. J. Storkey and M. Sugiyama, “Mixture regression for covariate shift,” in Proc. NIPS, 2006, pp. 1337–1344. [5] S. Bickel, M. Brückner, and T. Scheffer, “Discriminative learning under covariate shift,” J. Mach. Learn. Res., vol. 10, no. 10, pp. 2137–2155, 2009. [6] X. Liao, Y. Xue, and L. Carin, “Logistic regression with an auxiliary data source,” in Proc. ICML, 2005, pp. 505–512. [7] J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf, “Correcting sample selection bias by unlabeled data,” in Proc. NIPS, 2006, pp. 601–608. [8] S. Bickel, M. Brückner, and T. Scheffer, “Discriminative learning for differing training and test distributions,” in Proc. ICML, 2007, pp. 81–88.


[9] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by importance weighted cross validation,” J. Mach. Learn. Res., vol. 8, no. 5, pp. 985–1005, 2007. [10] J. Jiang and C. Zhai, “Instance weighting for domain adaptation in NLP,” in Proc. ACL, 2007, pp. 264–271. [11] A. Gretton, K. Borgwardt, M. Rasch, B. Scholkopf, and A. Smola, “A kernel method for the two-sample-problem,” in Proc. NIPS, 2007, pp. 513–520. [12] M. Sugiyama, T. Suzuki, and T. Kanamori, “Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation,” Ann. Inst. Stat. Math., vol. 11, pp. 1–36, Nov. 2011. [13] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proc. EMNLP, 2006, pp. 120–128. [14] J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification,” in Proc. CEMNLP, 2007, pp. 187–205. [15] H. Daumé, “Frustratingly easy domain adaptation,” in Proc. ACL, 2007, pp. 256–263. [16] W. Dai, O. Jin, G.-R. Xue, Q. Yang, and Y. Yu, “Eigentransfer: A unified framework for transfer learning,” in Proc. ICML, 2009, pp. 193–200. [17] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure, “Cross domain distribution adaptation via kernel mapping,” in Proc. 15th Int. Conf. KDD, 2009, pp. 1027–1036. [18] S. J. Pan, I. Tsang, J. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Trans. Neural Netw., vol. 22, no. 2, pp. 199–210, Feb. 2011. [19] S. J. Pan, X. Ni, J.-T. Sun, Q. Yang, and Z. Chen, “Cross-domain sentiment classification via spectral feature alignment,” in Proc. 19th Int. Conf. WWW, 2010, pp. 751–760. [20] G. Schweikert, C. Widmer, B. Schölkopf, and G. Rätsch, “An empirical analysis of domain adaptation algorithm for genomic sequence analysis,” in Proc. NIPS, 2009, pp. 1433–1440. [21] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [22] X. Shi, Q. Liu, W. Fan, Q. Yang, and P. S. Yu, “Predictive modeling with heterogeneous sources,” in Proc. SDM, 2010, pp. 814–825. [23] L. Duan, D. Xu, and I. W.-H. Tsang, “Domain adaptation from multiple sources: A domain-dependent regularization approach,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 504–518, Mar. 2012. [24] R. Herbrich, T. Graepel, and K. Obermayer, “Support vector learning for ordinal regression,” in Proc. ICANN, Sep. 1999, pp. 97–102. [25] W. Chu and S. S. Keerthi, “New approaches to support vector ordinal regression,” in Proc. ICML, 2005, pp. 145–152. [26] J. J. Heckman, “Sample selection bias as a specification error,” Econometrica, vol. 47, no. 1, pp. 153–61, Jan. 1979. [27] F. Vella, “Estimating models with sample selection bias: A survey,” J. Human Resour., vol. 33, no. 1, pp. 127–169, 1998. [28] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in Machine Learning. Cambridge, MA, USA: MIT Press, 2009. [29] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” in Proc. NIPS, 2005, pp. 1537–1544. [30] W.-S. Zheng, S. Gong, and T. Xiang, “Quantifying and transferring contextual information in object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 762–777, Apr. 2012. [31] Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou, “Tighter and convex maximum margin clustering,” in Proc. AISTATS, 2009, pp. 344–351. [32] J. J. Lim, R. Salakhutdinov, and A. Torralba, “Transfer learning by borrowing examples for multiclass object detection,” in Proc. NIPS, 2011, pp. 118–126. [33] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, “Training hierarchical feed-forward visual recognition models using transfer learning from pseudo-tasks,” in Proc. ECCV, 2008, pp. 69–82. [34] A. Farhadi, D. A. Forsyth, and R. White, “Transfer learning in sign language,” in Proc. CVPR, Jun. 2007, pp. 1–8. [35] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. 16th ICML, 1999, pp. 200–209. [36] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, “Transductive ordinal regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1074–1086, Jul. 2012. [37] C.-W. Seah, I.-T. Tsang, and Y.-S. Ong, “Healing sample selection bias by source classifier selection,” in Proc. 11th ICDM, 2011, pp. 577–586. [38] L. Li and H.-T. Lin, “Ordinal regression by extended binary classification,” in Proc. NIPS, 2006, pp. 865–872.

1875

[39] J. S. Cardoso and J. F. Pinto da Costa, “Learning to classify ordinal data: The data replication method,” J. Mach. Learn. Res., vol. 8, no. 12, pp. 1393–1429, 2007. [40] P. A. Gutiérrez, M. Pérez-Ortiz, F. Fernández-Navarro, J. SánchezMonedero, and C. Hervás-Martínez, “An experimental study of different ordinal regression methods and measures,” in Proc. 7th Int. Conf. HAIS, Mar. 2012, pp. 296–307. [41] C. E. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT Press, 2006. [42] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [43] S.-J. Kim and S. Boyd, “A minimax theorem with applications to machine learning, signal processing, and finance,” SIAM J. Optim., vol. 19, no. 3, pp. 1344–1367, Nov. 2008. [44] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, no. 1, pp. 27–72, Jan. 2004. [45] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in Proc. ICML, 2010, pp. 1175–1182. [46] J. E. Kelley, “The cutting-plane method for solving convex programs,” Soc. Ind. Appl. Math., vol. 8, no. 4, pp. 703–712, Dec. 1960. [47] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman, “Learning bounds for domain adaptation,” in Proc. NIPS, 2007, pp. 129–136. [48] H.-T. Lin and L. Li, “Reduction from cost-sensitive ordinal ranking to weighted binary classification,” Neural Comput., vol. 24, no. 5, pp. 1329–1367, 2012. [49] C.-W. Seah, I. W. Tsang, Y.-S. Ong, and K.-K. Lee, “Predictive distribution matching SVM for multi-domain learning,” in Proc. ECML PKDD, Sep. 2010, pp. 231–247. [50] L. Bruzzone and M. Marconcini, “Domain adaptation problems: A DASVM classification technique and a circular validation strategy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 5, pp. 770–787, May 2010. [51] J. Jiang and C. Zhai, “A two-stage approach to domain adaptation for statistical classifiers,” in Proc. 16th CIKM, 2007, pp. 401–410. [52] N. V. Chawla, N. Japkowicz, and A. Kotcz, “Editorial: Special issue on learning from imbalanced data sets,” SIGKDD Explorations Newslett., vol. 6, no. 1, pp. 1–6, Jun. 2004. [53] M. Sokolova, N. Japkowicz, and S. Szpakowicz, “Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation,” in Proc. 19th Austral. Joint Conf. Artif. Intell., Dec. 2006, pp. 1015–1021. [54] J. Demvsar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, no. 12, pp. 1–30, Jan. 2006.

Chun-Wei Seah received the B.Eng. (Hons.) and Ph.D. degrees in computer science from the School of Computer Engineering, Nanyang Technological University, Singapore, in 2009 and 2013, respectively. He is currently a Technical Staff Senior Member with the Defence Science Organisation National Laboratories. His current research interests include transductive learning, transfer learning, rank learning, and sentiment prediction. Dr. Seah was a recipient of the Prestigious Nanyang President’s Graduate Scholarship in 2009.

1876


Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Kowloon, Hong Kong, in 2007. He is currently an Assistant Professor with the School of Computer Engineering, Nanyang Technological University (NTU), Singapore. He is the Deputy Director of the Center for Computational Intelligence, NTU. Dr. Tsang received the Prestigious IEEE T RANS ACTIONS ON N EURAL N ETWORKS Outstanding 2004 Paper Award in 2006 and the 2008 National Natural Science Award (Class II), China, in 2009. He received the Best Student Paper Award from the 23rd IEEE Conference on Computer Vision and Pattern Recognition in 2010, the Best Paper Award at the 23rd IEEE International Conference on Tools with Artificial Intelligence in 2011, the 2011 Best Student Paper Award from PREMIA, Singapore, in 2012, and the Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum in 2006. He was conferred with the Microsoft Fellowship in 2005.

Yew-Soon Ong received the B.S. and M.S. degrees in electrical and electronics engineering from Nanyang Technological University (NTU), Singapore, in 1998 and 1999, respectively, and the Ph.D. degree in artificial intelligence in complex design from the Computational Engineering and Design Center, University of Southampton, Southampton, U.K., in 2002. He is currently an Associate Professor and the Director of the Center for Computational Intelligence, School of Computer Engineering, NTU. His current research interests include computational intelligence spans across memetic computing, evolutionary design, machine learning, agent-based systems, and cloud computing. Dr. Ong is the founding Technical Editor-in-Chief of Memetic Computing Journal, a Chief Editor of the Springer book series on Studies in Adaptation, Learning, and Optimization, an Associate Editor of the IEEE C OM PUTATIONAL I NTELLIGENCE M AGAZINE, the IEEE T RANSACTIONS ON S YSTEMS , M AN AND C YBERNETICS PART B, Soft Computing, Information Sciences, and International Journal of System Sciences. He also Chairs the IEEE Computational Intelligence Society Emergent Technology Technical Committee and has served as a Guest Editors of several journals.

Ordinal Distance Metric Learning for Image Ranking.

Negative correlation ensemble learning for ordinal regression.

Answering Ordinal Questions with Ordinal Data Using Ordinal Statistics.

Multi-atlas segmentation with robust label transfer and label fusion.

Local Rademacher Complexity for Multi-Label Learning.

Learning classification models with soft-label information.

Cost-sensitive AdaBoost algorithm for ordinal regression based on extreme learning machine.

The role of instructions in the transfer of ordinal functions through equivalence classes.

Unregistered biological words recognition by Q-learning with transfer learning.

Transductive ordinal regression.

Transfer learning for visual categorization: a survey.

Transfer representation learning for medical image analysis.

Domain Transfer Learning for MCI Conversion Prediction.

Promoting learning transfer in preceptor preparation.

Effects of learning duration on implicit transfer.

Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

Instance transfer learning with multisource dynamic TrAdaBoost.

A Transfer Learning Approach for Network Modeling.

Ordinal and interval data analysis.

From sequence to enzyme mechanism using multi-label machine learning.

Predicting drug side effects by multi-label learning and ensemble learning.

Learning from label proportions in brain-computer interfaces: Online unsupervised learning with guarantees.

Spatial Representation of Ordinal Information.

Graphical Models for Ordinal Data.