IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 12, DECEMBER 2015

3163

A Divide-and-Conquer Method for Scalable Robust Multitask Learning Yan Pan, Rongkai Xia, Jian Yin, and Ning Liu Abstract— Multitask learning (MTL) aims at improving the generalization performance of multiple tasks by exploiting the shared factors among them. An important line of research in the MTL is the robust MTL (RMTL) methods, which use trace-norm regularization to capture task relatedness via a low-rank structure. The existing algorithms for the RMTL optimization problems rely on the accelerated proximal gradient (APG) scheme that needs repeated full singular value decomposition (SVD) operations. However, the time complexity of a full SVD is O(min(md 2 , m2 d)) for an RMTL problem with m tasks and d features, which becomes unaffordable in real-world MTL applications that often have a large number of tasks and high-dimensional features. In this paper, we propose a scalable solution for large-scale RMTL, with either the least squares loss or the squared hinge loss, by a divide-and-conquer method. The proposed method divides the original RMTL problem into several size-reduced subproblems, solves these cheaper subproblems in parallel by any base algorithm (e.g., APG) for RMTL, and then combines the results to obtain the final solution. Our theoretical analysis indicates that, with high probability, the recovery errors of the proposed divide-and-conquer algorithm are bounded by those of the base algorithm. Furthermore, in order to solve the subproblems with the least squares loss or the squared hinge loss, we propose two efficient base algorithms based on the linearized alternating direction method, respectively. Experimental results demonstrate that, with little loss of accuracy, our method is substantially faster than the state-of-the-art APG algorithms for RMTL. Index Terms— Divide-and-conquer method, linearized alternating direction method (LADM), low-rank matrices, multitask learning (MTL).

I. I NTRODUCTION

M

ULTITASK learning (MTL) aims at improving the generalization performance of multiple tasks by utilizing the shared information among them. MTL has proved its success in various applications, such as natural language processing [1], handwritten character recognition [2], [3], and medical diagnosis [4]. Many MTL methods have been proposed in [1], [2], and [5]–[14]. Manuscript received May 29, 2013; revised June 25, 2014, October 21, 2014, and February 13, 2015; accepted February 21, 2015. Date of publication March 10, 2015; date of current version November 16, 2015. This work was supported in part by the Natural Science Foundation of Guangdong Province, China, under Grant S2013010011905, and in part by the National Natural Science Foundation of China under Grant 61370021, Grant U1401256, Grant 61472453, Grant 61472455. (Corresponding author: N. Liu.) Y. Pan and N. Liu are with the School of Software, Sun Yat-sen University, Guangzhou 510275, China (e-mail: [email protected]; [email protected]). R. Xia and J. Yin are with the School of Information Science and Technology, Sun Yat-sen University, Guangzhou 510275, China (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2015.2406759

A notable research line in MTL methods is the recently proposed robust MTL (RMTL) method [13], [14]. The model parameters in RMTL can be decomposed into a low-rank matrix and a sparse matrix, where the low-rank matrix captures the underlying shared factors among all tasks and the sparse matrix identifies task specificity (e.g., sparse discriminative features [13] or outlier tasks [14]). Trace-norm regularization is used in RMTL as a convex surrogate of the nonconvex rank constraint, resulting in trackable optimization problems. Accelerated proximal gradient (APG) scheme is the state-of-the-art algorithm to solve the optimization problems of RMTL due to its guaranteed O(T −2 ) convergence rate, where T is the iteration number. However, APG solves the trace-norm regularized RMTL optimization problems via repeated full singular value decomposition (SVD) operations, which limits its power in real-world applications with large number of tasks and high-dimensional features. More specifically, for an MTL problem with m tasks and d features, the time complexity of a full SVD operation is O(min(m 2 d, d 2 m)), which scales poorly when m and d are large. In contrast to most of the existing RMTL methods focusing on RMTL with the least squares loss, we investigate RMTL with either the least squares loss or the squared hinge loss. We propose a scalable method for large-scale RMTL. Our method has two building blocks. First, inspired by the recent advances in divide-and-conquer matrix factorization [15], [16], we propose a divide-and-conquer solution to RMTL. As shown in Fig. 1, our method randomly divides the original RMTL problem into several size-reduced subproblems, solves the subproblems (with much cheaper cost) in parallel by any base algorithm for RMTL (e.g., APG), and then combines the results from the subproblems to obtain the final solution. We provide a theoretical result indicating that with overwhelming probability, the proposed divide-and-conquer algorithm has recovery guarantees comparable with those of its base algorithm. Second, to further improve the scalability, we propose to solve each subproblem (with either the least squares or the squared hinge loss) via efficient base algorithms based on the linearized alternating direction method (LADM) [17] scheme. Extensive experiments on synthetic as well as real-world data sets show that, with little loss of accuracy (i.e., no more than 1.22% in absolute in all our experiments), the proposed divide-and-conquer method substantially improves the efficiency of solving RMTL problems. As an example, on the EUR-Lex data set with ∼400 tasks and 5000 features, the proposed method for the least squares loss performs more than 77× faster than the state-of-the-art APG method for RMTL.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

3164

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 12, DECEMBER 2015

Fig. 1. Proposed divide-and-conquer framework for RMTL. The original problem is first divided into several size-reduced subproblems (the detailed procedure can be found in Algorithm 2), and then the subproblems are solved in parallel. After that, the results from the subproblems are combined to get a matrix L and a column-sparse matrix E. Finally, L is projected onto an orthogonal subspace to get its accurate low-rank approximation L.

II. R ELATED W ORK The existing MTL methods can be roughly divided into three main categories: 1) parameters sharing; 2) common features sharing; and 3) low-rank subspace sharing. In the methods with parameter sharing, all tasks are assumed to explicitly share some common parameters. Representative methods in this category include shared weight vectors [6], hidden units in neural network [5], and common prior in hierarchical Bayessian models [7], [8]. In the methods with common features sharing, task relatedness is modeled by enforcing all tasks to share a common set of features [2], [9]–[12], [18], [19]. Representative examples are the methods which constrain the model parameters (e.g., a weight matrix) of all tasks to have certain sparsity patterns, for example, cardinality sparsity [2], group sparsity [9], [11], and clustered structure [10], [12]. The methods in the third category assume that all tasks lie in a shared low-rank subspace [1], [13], [14]. An important line of research in this category is the RMTL methods [13], [14]. The existing optimization algorithms in the RMTL methods rely on the APG [20] scheme in which repeated full SVD operations are needed. This makes the existing RMTL methods scale poorly on large-scale data. Our work is motivated by the recent advances in the efficient algorithms for large-scale low-rank matrix learning [15], [16], [21], [22]. Mackey et al. [15] proposed a divide-and-conquer algorithm for robust principal component analysis (RPCA) [23] problems. Pan et al. [16] proposed a divide-and-conquer method for scalable low-rank latent matrix pursuit in robust late fusion. While our method shares similar features as [15], the RMTL problem is considerably more challenging than RPCA due to the fact that the low-rank matrix is multiplied by a data matrix [i.e., the matrix A in (4)]. Notations: For a matrix M ∈ Rd×m , we denote M. j ∈ Rd as the j th column  of M, and Mi, j as the (i, j )th entry in M. Let M1,2 = mj=1 M. j 2 be the 1,2 -norm of M, where  M. j 2 = ( di=1 Mi,2 j )1/2 represents the 2 -norm of M. j .  σk (M), The trace norm of M is defined by M∗ = rank(M) k=1 rank(M) are the nonzero singular values of where {σk (M)}k=1 M and rank(M) is the rank of M. III. F ORMULATIONS OF RMTL (i

In MTL, we are given m learning tasks. The i th = 1, . . . , m) task is associated with a training

set {(X(i) , y(i) )}, where X(i) ∈ Rni ×d denotes the data matrix with each row being a sample, y(i) ∈ Rni denotes the target column (i.e., labels), d is the feature dimensionality, and n i is the number of samples for the i th task. The goal of (linear) MTL is to simultaneously learn m linear predictors w(i) (i = 1, . . . , m) to minimize some loss function (y(i) , X(i) w(i) ), where w(i) ∈ Rd is a column vector. In this paper, we focus on MTL with the least squares loss (i.e., y(i) − X(i) w(i) 2F ) and MTL with the squared hinge loss1 (i.e., max(0, 1 − y(i) X(i) w(i) )2 ). RMTL is an important line of research in MTL. In this paper, we focus on the RMTL formulation in [14],2 which decomposes the model parameter matrix into a low-rank matrix and a column-sparse matrix. More specifically, we denote the model parameter matrix of m tasks by W = [w(1), . . . , w(m) ] ∈ Rd×m , then we have W = L+E

(1)

where L is a low-rank matrix to capture the shared factors among multiple tasks, and E is a column-sparse matrix to identify the outlier (irrelevant) tasks. With the low-rank and column-sparse assumptions, RMTL with the least square loss is formulated as min L∗ + λE1,2 L,E

s.t y(i) = X(i) (L.i + E.i ), i = 1, . . . , m

(2)

where the trace norm L∗ is a well-known convex surrogate of rank(L), which induces the desirable low-rank structure in L, and the 1,2 -norm E1,2 encourages the column-sparsity in E. Similarly, RMTL with the squared hinge loss is formulated as m  B(i) 2+ min L∗ + λE1,2 + β L,E,B(i)

(i)

s.t B

= 1ni ×1 − y

(i)

i=1 (i)

 X (L.i + E.i ), i = 1, . . . , m (3)

1 Our method can be easily extended to the case with hinge loss, i.e., max(0, 1 − y(i) X(i) w(i) ). Note that the APG method, one of the representative baselines used in this paper, requires the objective function to be smooth. Hence, for a fair comparison, we only report the results of RMTL with the smooth squared hinge loss, and omit the results with the nonsmooth hinge loss. 2 The proposed approach can be easily applied to other RMTL formulations such as [13].

PAN et al.: DIVIDE-AND-CONQUER METHOD FOR SCALABLE RMTL

3165

 2 where B(i) 2+ = nj i=1 max(0, B(i) j ) , and 1n i ×1 is a vector with n i ones. In order to facilitate the subsequent analysis, we rewrite (2) to an equivalent compact form min L∗ + λE1,2 L,E

s.t D  Y = D  (A(L + E))

(4)

where A [(X(1))T , . . . , (X(m) )T ]T ∈ R N×d , m = N = i=1 n i and n i is the number of samples for the i th task. Y ∈ R N×m is a matrix that has identical columns, i.e., Y.i = [(y(1))T , . . . , (y(m) )T ]T (i = 1, . . . , m). D is an indicator matrix of the same size as Y, namely, D = [(D(1))T , . . . , (D(m) )T ]T where for i = 1, . . . , m, D(i) ∈ Rni ×m , the i th column in D(i) is with all ones and other columns in D(i) are with all zeros. The operator  denotes the element-wise (Hadamard) multiplication operator. Similarly, we rewrite (3) to its compact form min L∗ + λE1,2 + βM2+

L,E,M

s.t

M = D − D  Y  (A(L + E))

(5)

where D, Y, A, and  are the same as in (4). M ∈ R N×m is a matrix with the same size as Y. Let B ∈ R N×m be a matrix that has identical columns, i.e., B.i = [(B(1) )T , . . . , (B(m) )T ]T (i =  1, . . . , m). Then, we m N 2 have M = D  B. M+ = i=1 j =1 max(0, M j,i ) . With thedefinition of M, it is easy to verify that (i) 2 M2+ = m i=1 B + .

Algorithm 1 Random Partitioning Process Input: Y ∈ R N×m , m = t × l for i = 1 to t do for j = 1 to l do Randomly select a column x from Y (without replacement). Set x as the j -th column in the i -th submatrix Yi . end for end for Output: the l-column submatrices Y1 , Y2 , . . . , Yt Algorithm 2 Divide-and-Conquer Algorithm for RMTL (DC-RMTL) Input: A ∈ R N×d , Y ∈ R N×m , D ∈ R N×m , M ∈ R N×m (if applicable), the number t of partitions 1. Randomly partition Y, D, M (if applicable) into t l-column submatrices, respectively (l = mt ). 2. do in parallel Use a base algorithm to solve the 1-st subproblem in (6) or (7) to get L1 and E1 .. . Use a base algorithm to solve the t-th subproblem in (6) or (7) to get Lt and Et end do  3. L = UUT L where L is the concatenation of L1 , L2 , . . . , Lt . Output: L ∈ Rd×m , E ∈ Rd×m (E is the concatenation of E1 , E2 , . . . , Et )

IV. D IVIDE - AND -C ONQUER S OLUTION The existing optimization algorithms for RMTL [13], [14] rely on the family of APG [20] methods, which are wellknown to have a fast convergence rate. However, when solving a trace-norm regularized problem like (4) or (5), the APG methods need a full SVD operation on a d × m matrix to solve the proximal operator of the trace norm in each iteration, where the time complexity of a full SVD is O(min(m 2 d, d 2 m)). Hence, the repeated full SVD operations in APG are computationally expensive, which make APG scale poorly on real-world MTL applications with large feature dimensionality (d) and large number of tasks (m). Inspired by the recent advances in divide-and-conquer methods [15] for the RPCA problems [23], we propose a scalable divide-and-conquer method for large-scale RMTL problems. As shown in Fig. 1, the proposed method divides the optimization problem in (4) or (5) into several size-reduced subproblems, solves the much cheaper subproblems in parallel by any base algorithm for RMTL (e.g., the APG algorithm, or the LADM algorithms in Section V), then combines the results from the subproblems to get the low-rank matrix L and the column-sparse matrix E. Hereafter, we call the proposed divide-and-conquer method (DC-RMTL). In particular, for the method with LADM/APG being its base algorithm, we call it LADM-DC-RMTL/APG-DC-RMTL, respectively. The sketch of the proposed DC-RMTL method is summarized in Algorithm 2. It has the following three main steps.

A. Partitioning the Matrices We randomly partition the matrix Y ∈ R N×m into t l-column submatrices3 Y1 , . . . , Yt . The detailed random partitioning process can be found in Algorithm 1. Without loss of generality, we assume Y = [Y1 , Y2 , . . . , Yt ]. Correspondingly, we partition D, L, E, and M into D = [D1 , D2 , . . . , Dt ], L = [L1 , L2 , . . . , Lt ], E = [E1 , E2 , . . . , Et ], and M = [M1 , M2 , . . . , Mt ], respectively. B. Solving the Subproblems in Parallel We solve all the subproblems in parallel. In the case of RMTL with the least square loss, for each subproblem with size-reduced Yi , Di , Li , and Ei , we solve the following optimization problem: min Li ∗ + λEi 1,2

Li ,Ei

s.t Di  Yi = Di  (A(Li + Ei )).

(6)

Obviously, this is the same optimization problem as (4), only with size-reduced matrices. Hence, it can be solved by any base algorithm for RMTL (e.g., APG [14], or the 3 For the case m = l × t with some integer t, we partition Y into t l-column submatrices. For the case m = l × t + c with 1 < c < l, we partition Y into t l-column submatrices and a c-column submatrix. Identical setting is used to partition other matrices (D, L , E). Here, we simply assume m = l × t for ease of presentation.

3166

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 12, DECEMBER 2015

LADM algorithms in Section V). Similarly, in the case of RMTL with the squared hinge loss, for each subproblem with Yi , Di , Li , Ei , and Mi , we solve the following problem: min

Li ,Ei ,Mi

Li ∗ + λEi 1,2 +

βMi 2+

s.t Mi = Di − Di  Yi  (A(Li + Ei )).

(7)

More importantly, in contrast to directly solving the original problem in (4) [or (5)], it needs much cheaper cost to solve each subproblem in (6) [or (7)] by assuming l  min(d, m), i.e., the cost for a full SVD operation is changed from O(min(d 2 m, dm 2 )) (with respect to L ∈ Rd×m ) to O(dl 2 ) (with respect to L i ∈ Rd×l ). In Section V, we will present efficient algorithms to solve the subproblems for the least squares loss or the squared hinge loss, based on the LADM scheme [17].

C. Recovering L and E From the Submatrices After obtaining Li and Ei from the subproblems, we combine them and get L = [L1 , L2 . . . , Lt ] and E = [E1 , E2 . . . , Et ]. Although each subproblem encourages the learned Li to be a low-rank matrix, the combined matrix L = [L1 , L2 . . . , Lt ] is not necessarily a low-rank matrix as what we want. Hence, we need a further step to get an approximate matrix of L by projecting L onto a low-rank subspace. Here, we use the column-projection method [24]. More specifically, by assuming L1 = UVT be the compact SVD form of L1 , we construct L by L = UUT L

(8)

where UUT is the orthogonal subspace spanned by the columns of L1 , whose rank is no more than l (l  min(d, m)). Hence, L = UUT L has a rank no more than l. Moreover, existing theoretical results on low-rank matrix approximation indicate that, with overwhelming probability, L = UUT L is an accurate low-rank approximation of L as long as l is not too small (Lemma 1). V. E FFICIENT BASE A LGORITHMS FOR S UBPROBLEMS The subproblem in (6) [or (7)] can be solved by any base algorithm for RMTL. The existing methods for RMTL rely on the APG scheme. To solve (6) [or (7)], APG needs to convert (6) [or (7)] into an approximate unconstrained optimization problem by moving the equality constraint to the objective function as a penalty. In this section, we propose two efficient algorithms to solve subproblems in (6) and (7), respectively, based on the recently proposed LADM [17], which has shown a good balance between efficiency and accuracy in many matrix learning problems. In contrast to solving an approximate optimization problem of (6) [or (7)] in APG, the proposed base algorithm solves (6) [or (7)] exactly (in the sense of the equality constraint).

A. Subproblem With the Least Squares Loss For the optimization problem in (6), the corresponding augmented Lagrange function is L(Li , Ei , C, μ) = Li ∗ + λEi 1,2 + C, Di  (A(Li + Ei ) − Yi )

μ + Di  (A(Li + Ei ) − Yi )2F (9) 2 where C is the Lagrange multiplier, ·, · denote the inner product, and μ > 0 is the penalty parameter. Our algorithm is to update the variables Li , Ei , and C alternatively, by minimizing L with other variables being fixed. Let Lki , Eki , and Ck be the solution at the kth iteration. For ease of presentation, we define  Ck  Pk = Di  AEki − Yi + μ  k+1  Ck k . Q = Di  ALi − Yi + μ Then, the iterative update scheme in our algorithm follows: μ ← argmin Li ∗ + Di  ALi + Pk 2F Lk+1 i 2 Li μ k+1 Ei ← argmin λEi 1,2 + Di  AEi + Qk 2F 2 Ei      k+1 k+1 k − Yi . ← C + μ Di  A Li + Ek+1 C i

is as (10a) (10b) (10c)

Both the optimization problems in updating Li and Ei do not have closed-form solutions. To address this difficulty, we use the linearization technique as described in [17]. More specifically, we approximate the objectives in (10a) and (10b) by linearizing their quadratic terms, respectively 1 Di  ALi + Pk 2F 2 2    1  1 Li − Lk 2 ≈ Di  ALki + Pk  F + G1k , Li − Lki + i F 2 2τ (11a) 1 k 2 Di  AEi + Q  F 2 2    1 1  Ei − Ek 2 ≈ Di  AEki + Qk  F + G2k , Ei − Eki + i F 2 2τ (11b) where τ > 0 is the proximal parameter, and    G1k = AT Di  Di  ALki + Pk    G2k = AT Di  Di  AEki + Qk which are the gradients of the quadratic terms in (10a) and (10b), respectively. Plugging (11a) into (10a) and with simple algebra, we obtain the following approximation to (10a):   μ Li − Lk − τ Gk 2 . Lk+1 ← argmin Li ∗ + (12) 1 F i i 2τ Li The problem in (12) has a closed-form solution by the singular value threshold method [25]. Concretely, by assuming ˆ ˆ ˆ T be the SVD form of (Lk − τ Gk ), the update rule for U 1 i L  L VL Li is as follows: ˆ L Sτ/μ (ˆL )V ˆ TL Lk+1 ← U

(13)

PAN et al.: DIVIDE-AND-CONQUER METHOD FOR SCALABLE RMTL

3167

Algorithm 3 Base Algorithm (LADM-RMTL) for RMTL With Least Square Loss

Algorithm 4 Base Algorithm (LADM-RMTL) for RMTL With Squared Hinge Loss

Input: A ∈ R N×d , Y ∈ R N×m , τ Initialize: C0 = 0, L0 = 0, E0 = 0, μ = 10−6 , ρ = 1.9, maxμ = 1010 , = 10−5 repeat 1. Updating L via (13) 2. Updating E via (15) 3. Ck+1 = Ck + μ(D  (A(Lk+1 + Ek+1 ) − Y)) 4. μ = min(ρμ, max μ ) 5. k = k + 1 Ek − Ek−1  F Lk − Lk−1  F ≤ and ≤ until max(1, Lk−1  F ) max(1, Ek−1  F ) d×m d×m Output: L ∈ R ,E ∈ R

Input: A ∈ R N×d , Y ∈ R N×m , τ Initialize: C0 = 0, L0 = 0, E0 = 0, M0 = 0, μ = 10−6 , ρ = 1.9, maxμ = 1010 , = 10−5 repeat 1. Updating L via (20) 2. Updating E via (21) 3. Updating M via (18) 4. Updating C via (10c) 5. μ = min(ρμ, maxμ ) 6. k = k + 1 Lk − Lk−1  F Ek − Ek−1  F until ≤ and ≤ k−1 max(1, L  F ) max(1, Ek−1  F ) Output: L ∈ Rd×m ,E ∈ Rd×m

where Sδ (X) = max(X−δ, 0)+min(X+δ, 0) is the shrinkage operator [26]. Similarly, plugging (11b) into (10b), we have   μ Ei − Ek − τ Gk 2 (14) Ek+1 ← argmin λEi 1,2 + i 2 F i 2τ Ei

We update Mi by solving the following optimization problem:    Ck 2  k μ k  M + − D + F  A L + E min βMi 2+ +  i i i i i Mi 2 μ F (17)

which is also known to have a closed-form solution [14]. Concretely, we denote Hk = Eki − τ G2k , (Ek+1 ). j and i k+1 k k and H , respectively. Then which has a closed-form solution H. j as the j th column of Ei ⎧ for j = 1, 2, . . . , l, the j th column of Ei can be updated by ⎨(Bi ) j,k , (Bi ) j,k ≤ 0 ⎛ ⎞ ⎧  k+1  λτ = (18) Mi ) μ(B ⎪ i j,k ⎪ j,k ⎪ ⎩  k , (Bi ) j,k > 0 ⎜ λτ ⎪ μ ⎟ k ⎜ ⎪  ⎟  2β + μ ⎨ H. j 2 > , H. j ⎝1 −  k+1  μ H.kj 2 ⎠ Ei (15) = .j ⎪ where Bi = Di − Fi  A(Lki + Eki )) − Ck /μ. ⎪ ⎪ ⎪   λτ ⎪ We also define ⎩0, 0 ≤ H.kj 2 ≤ . μ   Ck Pk = Mik+1 − Di + Fi  AEki + Algorithm 3 shows the sketch of the proposed base μ algorithm for RMTL with the least squares loss, in which   Ck we follow [17] to set the convergence conditions and set Qk = Mik+1 − Di + Fi  ALk+1 . + i μ τ = 1/1.02[σ (A)]2 in (13) and (15), where σ (A) denotes the max singular value of A. Then, similar to the update scheme in (10a), (10b), and (10c), we have the following update rules for Li , Ei , and C: B. Subproblem With the Squared Hinge Loss μ ← argmin Li ∗ + Fi  ALi + Pk 2F (19a) Lk+1 i We propose a novel algorithm for the RMTL subproblem 2 Li with the squared hinge loss (Algorithm 4), which is also based μ ← argmin λEi 1,2 + Fi  AEi + Qk 2F . (19b) Ek+1 on the LADM scheme. i 2 Ei For the subproblem in (7), the corresponding augmented    k+1 k+1 k C . ← C + μ Mi − Di + Fi  A Lk+1 + Ek+1 Lagrange function is formulated as i i (19c)

L(Li , Ei , Mi , C, μ) = Li ∗ + λEi 1,2 + βMi 2+ + C, Mi − Di +Di  Yi  A(Li + Ei ))

μ (16) + Mi − Di + Di  Yi  A(Li + Ei ))2F 2 where C denotes the Lagrange multiplier. Based on the LADM scheme, the proposed optimization procedure alternatively updates one of Li , Ei , Mi , and C, while keeping other variables fixed. We define Lki , Eki , Mik , and Ck to be the solution at the kth iteration. We also define Fi = Di  Yi .

Note that the rule in (19b) has the same form as that in (10a). Hence, by applying the similar linearization tricks used in solving (10a), we can obtain the final update rule for Li ˆ L Sτ/μ (ˆL )V ˆ TL Lk+1 ← U

(20)

k T ˆ LV ˆT ˆ L where U L be the SVD form of (Li − τ (A (Fi  (Fi  k ALi + Pk )))). Similarly, (19b) has the same form as (10b). We can also apply the same linearization tricks as used in solving (10b)

3168

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 12, DECEMBER 2015

and get the final solution to (19b) ⎧   λτ ⎪ k ⎪  μ , H 1 − ⎨ .j  k+1  Hk  .j 2 Ei = .j ⎪ ⎪ ⎩0,

 k H  > λτ .j 2 μ  k λτ   0 ≤ H. j 2 ≤ μ

(21)

where Hk = Eki − τ (AT (Fi  (Fi  AEki + Qk ))). VI. T HEORETICAL A NALYSIS In this section, we provide a theoretical analysis to show that with high probability, the recovery error of DC-RMTL is bounded by the recovery errors of its base algorithm. Our analysis is based on the following lemma. Lemma ([15, Corollary 6]): Given a matrix L ∈ Rd×m , and L0 ∈ Rd×m as a rank-r approximation of L, choose l ≥ O(r log(m) log(1/δ)/ 2 ) and let C ∈ Rd×m be a submatrix of l column of L sampled uniformly without replacement. Then L − CC+ L F ≤ (1 + )L − L0  F

(22)

C+

with probability at least 1 − δ, where denotes the pseudoinverse of C (i.e., if C = UC C VCT is the SVD form of C, then C+ = VC C−1 UCT ). Our main result is the following theorem. Theorem 1: Let L = [L1 , L2 , . . . , Lt ] be the combined matrix of the results from the subproblems in Algorithm 3 (or Algorithm 4), and L∗ be the output low-rank matrix of Algorithm 3 (or Algorithm 4). Let L0 be the rank-r ground truth of L in the RMTL problem in (4). Assume L0 = [L01 , L02 , . . . , L0t ] where L0i has one–one corresponding column indices set to Li (i = 1, 2, . . . , t). Choose l ≥ O(r log(m) log(1/δ)/ 2 ). Then with probability at least 1 − δ, L∗ satisfies   t    ∗ 0 Li − L0 2 . L − L  F ≤ (2 + ) i

F

i=1

Proof: By the triangle inequality, we have L∗ − L0  F ≤ L∗ − L  F + L − L0  F .

(23)

Remind that L1 = U V T be the compact SVD form −1 T and of L1 . Then, the pseudoinverse of L+ 1 = V U + T ∗  T L1 L1 = UU . Hence, we have L − L = UU L − L = L1 L1T L −L . With the assumption on l, by applying Lemma 1, we have   L∗ − L  F = L − L1 L1T L  F ≤ (1 + )L − L0  F (24) with probability at least 1 − δ. By plugging (24) into (23), we have L∗ − L0  F ≤ (2 + )L − L0  F   t    Li − L0 2 = (2 + ) i

F

i=1

with probability at least 1 − δ. The theorem implies that, if the base algorithm would have small recovery errors (i.e., Li − L0i  F → 0), then with

overwhelming probability the divide-and-conquer DC-RMTL would also have a small recovery error (i.e., L∗ −L0  F → 0). Remarks: It would be better if we had a guaranteed recovery error bound for the base algorithm (like the bounds for the base algorithms of matrix completion and matrix factorization in [15], or robust late fusion in [16]). However, the recovery analysis of (6) is difficult due to the concurrent appearance of the noise part E and the missing values (introduced by the Hardamard operator). To the best of our knowledge, there are no theoretical results of recovery guarantees on problems in this form. Hence, we leave the recovery analysis of the base algorithm in future work. VII. E XPERIMENTS In this section, we evaluate the accuracies and running time of the proposed DC-RMTL on both real-world and simulated data sets. For either the case with the least squares loss or the one with the squared hinge loss, we compare the performances of four algorithms for RMTL: 1) the proposed divide-andconquer algorithms (APG-DC-RMTL with APG being the base algorithm, and LADM-DC-RMTL with LADM being the base algorithm); 2) the proposed base algorithm LADMRMTL; and 3) the APG algorithm in [14] (APG-RMTL).4 It is worth noting that for APG-DC-RMTL or LADM-DC-RMTL, we report their results with two settings: 1) [algorithm-name]-25% represents the divide-and-conquer algorithm that divides the original problem into four subproblems (t = 4) and 2) [algorithm-name]-10% represents the divide-and-conquer algorithm with 10 subproblems (t = 10). All the experiments were conducted with MATLAB R2010b on a Dell PowerEdge R320 server with 16-GB memory and 1.9-GHz E5-2420 CPU. For LADM-RMTL, the stopping criterion is as follows (see the convergence condition in Algorithm 3 or 4): Ek − Ek−1  F Lk − Lk−1  F ≤ and ≤ . max(1, Lk−1  F ) max(1, Ek−1  F )

(25)

We set = 10−5 in our experiments. For APG-RMTL, the stopping criterion used in the original implementation in [14] is |F(Lk+1 , Ek+1 ) − F(Lk , Ek )| ≤  F(Lk , Ek )

(26)

where  is a small positive constant (e.g.,  = 10−5 in our experiments), and F(L, E) is the following objective function: F(L, E) = L∗ + λE1,2 +βD  Y−D  (A(L + E))2F . For a fair comparison with LADM-RMTL, we use a modified stopping criterion in APG-RMTL. More specifically, the APG-RMTL algorithm stops when either the condition in (25) or the condition in (26) is satisfied.5 4 Our implementation of APG is based on the open source implementation in [14] at http://www.public.asu.edu/~jye02/Software/MALSAR/. 5 Under these settings, the running time comparison between APG-RMTL and LADM-RMTL/DC-RMTL is advantageous to APG-RMTL, because it would be easier to reach either one of two conditions than be restricted to reach one condition.

PAN et al.: DIVIDE-AND-CONQUER METHOD FOR SCALABLE RMTL

TABLE I S TATISTICS OF THE D ATA S ETS U SED IN O UR E XPERIMENTS

In the rest of this section, we report the results on both simulated and real-world data sets. In particular, we conduct six experiments. 1) The first two experiments are to evaluate and compare the accuracies and running time of the proposed DC-RMTL with either the least squares loss or the squared hinge loss, on real-world data sets. 2) The third experiment is to investigate the parameter sensitivity of DC-RMTL on real-world data sets. 3) The fourth experiment reports the Wilcoxon signedrank test results of DC-RMTL versus its corresponding competitors, on real-world data sets. 4) The fifth experiment is to investigate the effects on the proposed methods’ performances with different number of features/tasks, on simulated data sets. 5) The last experiment is to observe the effects of DC-RMTL with random partitioning, and with different partitioning strategies, on real-world data set. A. Results on Real-World Data Sets 1) Results of DC-RMTL With the Least Squares Loss: We first evaluate the performances of DC-RMTL on four realworld data sets.6 The statistics of the data sets are summarized in Table I. The Enron, Delicious, and Euro-Lex data sets are multilabel data sets for text analysis. The Mediamill data set is a data set for semantic concept detection in video. These data sets are multilabel data sets and each label column corresponds to a task in MTL. We use area under the curve (AUC), a popular metric for classification, as the evaluation metric in our experiments. We randomly split each data set into a training set and a test set of the ratio 4:6. We report the results by averaging over five trials. The regularization parameters in APG-RMTL, LADM-RMTL, DC-RMTL, and APG-RMTL are all chosen by cross validation within a set P = {10−3 , 10−2 , . . . , 102 , 103 } ∪ {10−3 × 5, 10−2 × 5, . . . , 102 × 5, 103 × 5}. To compare AUC, in addition to the algorithms for RMTL, we include the results of four other representative MTL algorithms.7 1) One Norm [27]: MTL with the least squares loss and the 1,1 -norm regularization. The regularization parameter is chosen by cross validation within P. 2) Trace Norm [28]: An MTL with the least squares loss and the trace-norm regularization. 6 http://mulan.sourceforge.net/datasets.html. Note that for the Mediamill data set, in order to make a fair comparison, we follow [13] to use a subset by randomly sampling 8000 data points. 7 The results of One Norm and Trace Norm are obtained by the open source implementations at http://www.public.asu.edu/~jye02/Software/MALSAR/. The results of IndSVM and RidgeReg are obtained by our implementations.

3169

The regularization parameter is chosen by cross validation within P. 3) IndSVM [29]: Each of the multiple tasks is solved independently as a linear  support vector machine problem (i) (i) minw(i) λ/2w(i) 22 + nj i=1 max(0, 1− y j (w(i) )T X j ). The regularization parameter is chosen by cross 20 validation within {10−i }3i=1 ∪ {2 × i }50 i=1 ∪ {200 ∗ i }i=1 . 4) RidgeReg [30]: Each of the multiple tasks is solved independently as a ridge regression problem n (i) (i) minw(i) λ/2w(i) 22 + j i=1 (y j − (w(i) )T X j )2 . The regularization parameter is chosen by cross validation within P. As shown in Table II, several observations can be made from the comparison results. 1) LADM-RMTL, LADM-DC-RMTL-25%, and LADM-DC-RMTL-10% have comparable performances with respect to AUC to APG-RMTL. The six algorithms for RMTL have superior performance gains with respect to AUC over the other baselines. 2) With nearly the same AUC values, LADM-RMTL shows near-linear (i.e., less than 10 times) speedup over APG-RMTL. For example, LADM-RMTL performs more than 4.5 times and 6 times faster than APG-RMTL on Delicious and EUR-Lex data sets, respectively. 3) With little loss of accuracies, LADM-DC-RMTL-25% and LADM-DC-RMTL-10% perform substantially faster than LADM-RMTL and APG-RMTL on all the data sets. For example, on EUR-Lex with 412 tasks and 5000 features, LADM-DC-RMTL-10% (LADMDC-RMTL-25%) performs nearly 12 (8) times faster than its base algorithm LADM-RMTL, and more than 77 (49) times faster than APG-RMTL. 4) The divide-and-conquer variant with APG being its base algorithm, APG-DC-RMTL, performs faster than APG-RMTL. But, LADM-DC-RMTL still shows clear speedup over APG-DC-RMTL. 2) Results of DC-RMTL With the Squared Hinge Loss: We also test and compare the performances of DC-RMTL with the squared hinge loss. The settings of data set splitting and parameters selection are the same as those in the case of DC-RMTL with the least squares loss. As can be seen in Table III, we obtain similar observations to those in the case with least squares loss: 1) the proposed LADM-RMTL and LADM-DC-RMTL show comparable accuracies with respect to AUC to APG-RMTL; 2) with close AUC values, LADM-RMTL shows near-linear speedup over APG-RMTL; and 3) with little loss of accuracies, LADM-DC-RMTL performs faster than APG-DCRMTL and LADM-RMTL. In particular, LADM-DC-RMTL shows substantial speedup over and APG-RMTL on all the data sets. For instance, on EUR-Lex, LADM-DC-RMTL-10% (LADM-DC-RMTL-25%) performs more than 7 (4) times faster than its base algorithm LADM-RMTL, and more than 45 (27) times faster than the state-of-the-art APG-RMTL. These observations indicate that the proposed divide-and-conquer method is promising for scalable multitask classification problems.

3170

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 12, DECEMBER 2015

TABLE II AVERAGED AUC (W ITH S TANDARD D EVIATION ) AND RUNNING T IME (S ECONDS ) C OMPARISON OF 10 A LGORITHMS ON F OUR D ATA S ETS . T HE R EGULARIZATION PARAMETERS (i.e., λ) IN A LL THE A LGORITHMS A RE C HOSEN BY C ROSS VALIDATION . A LL THE R EPORTED P ERFORMANCES A RE AVERAGED OVER F IVE T RIALS

TABLE III AVERAGED AUC (W ITH S TANDARD D EVIATION ) AND RUNNING T IME (S ECONDS ) W ITH S QUARED H INGE L OSS ON F OUR D ATA S ETS . A LL THE R EPORTED P ERFORMANCES A RE AVERAGED OVER F IVE T RIALS

Fig. 2. Comparison results of APG-RMTL, LADM-RMTL, LADM-DC-RMTL-25%, and LADM-DC-RMTL-10% on EUR-Lex in terms of AUC (left) and computational time in log scale (right). Please see the explanations at the beginning of Section VII for the stopping criterion of each algorithm.

3) Results With Different Training Ratio: To have a comprehensive view of the DC-RMTL algorithms, we also investigate the accuracies and computational costs of LADM-DC-RMTL-10%, LADM-DC-RMTL-25% with different training ratio (the percentage of samples for training on the whole data set). We use the EUR-Lex data set for this experiment. As shown in Fig. 2, two observations can be made from the results. 1) LADM-DC-RMTL-25% and LADM-DC-RMTL-10% have comparable accuracies with respect to AUC compared with APG-RMTL and LADM-RMTL, under different training ratio. 2) Under comparable AUC values, both of LADM-DCRMTL-25% and LADM-DC-RMTL-10% show superior efficiency gains over APG-RMTL and LADM-RMTL. 4) Results of Significant Tests: We further conduct experiments to observe the significant test results of DC-RMTL versus its corresponding competitors. First, we conduct the Wilcoxon signed-rank test [32]

of LADM-RMTL (APG-RMTL) versus its corresponding divide-and-conquer competitor on four data sets. For t being in the range of 2–10, all of the p-values for the Wilcoxon signed-rank test are larger than 0.3, which indicates that the difference between the predicted values of LADM-RMTL (APG-RMTL) and those of its corresponding divide-andconquer competitor is not significant.8 Second, we also conduct the Wilcoxon signed-rank test of APG-RMTL versus LADM-RMTL on four data sets. All of the p-values for the Wilcoxon signed-rank test are >0.2, which means that the difference between the predicted values of LADM-RMTL and those of APG-RMTL is not significant. B. Parameter Sensitivity The parameter t (number of partitions) in DC-RMTL is a key factor to speed up the algorithm. An implicit assumption 8 Note that the difference would be regarded as significant only when the p-value is

A Divide-and-Conquer Method for Scalable Robust Multitask Learning.

Multitask learning (MTL) aims at improving the generalization performance of multiple tasks by exploiting the shared factors among them. An important ...
2MB Sizes 5 Downloads 6 Views