870

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

Efficient Algorithms for Exact Inference in Sequence Labeling SVMs Alexander Bauer, Nico Görnitz, Franziska Biegler, Klaus-Robert Müller, and Marius Kloft Abstract— The task of structured output prediction deals with learning general functional dependencies between arbitrary input and output spaces. In this context, two loss-sensitive formulations for maximum-margin training have been proposed in the literature, which are referred to as margin and slack rescaling, respectively. The latter is believed to be more accurate and easier to handle. Nevertheless, it is not popular due to the lack of known efficient inference algorithms; therefore, margin rescaling—which requires a similar type of inference as normal structured prediction—is the most often used approach. Focusing on the task of label sequence learning, we here define a general framework that can handle a large class of inference problems based on Hamming-like loss functions and the concept of decomposability for the underlying joint feature map. In particular, we present an efficient generic algorithm that can handle both rescaling approaches and is guaranteed to find an optimal solution in polynomial time. Index Terms— Dynamic programming, gene finding, hidden Markov SVM, inference, label sequence learning, margin rescaling, slack rescaling, structural support vector machines (SVMs), structured output.

n X, Y x, y f w T N  δi ( yˆ ) 

N OMENCLATURE Number of patterns. Input and output space of sequences. A training pattern and the label sequence. Prediction function f : X → Y . Weight vector of prediction function f . Length of the label sequences y. Number of possible states of a label sequence per site. Joint feature mapping  : X × Y → Rd . Abbreviation for (x i , yi ) − (x i , yˆ ). Feature mapping.

Manuscript received April 3, 2013; accepted September 3, 2013. Date of publication October 2, 2013; date of current version April 10, 2014. This work was supported in part by the German Science Foundation under Grant DFG MU 987/6-1 and Grant RA 1894/1-1, in part by the World Class University Program of the Korea Research Foundation under Grant R31-10008, and in part by the BMBF project ALICE under Grant 01IB10003B. The work of M. Kloft was supported by the German Research Foundation. (Corresponding author: K.-R. Müller.) A. Bauer, N. Görnitz, and F. Biegler are with Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany (e-mail: alexba@ cs.tu-berlin.de; [email protected]; [email protected]). K.-R. Müller is with Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany, and also with the Department of Brain and Cognitive Engineering, Korea University, Seoul 136-713, Korea (e-mail: [email protected]). M. Kloft is with Courant Institute of Mathematical Sciences, New York, NY 10012 USA, and also with the Memorial Sloan-Kettering Cancer Center, New York, NY 10065 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2281761

  πi j1, . . . , j H gi h ·, · ⊗

Canonical binary vector representation of a state. Decomposability order of a mapping . Local projection of order  at position i . Consecutive sequence of individual states in a label sequence. Hamming loss. Real-valued functions on subsequences. Real-valued functions h : R → R. Inner product. Direct tensor product, (a ⊗ b)i+( j −1)·dim(a) = ai · b j . I. I NTRODUCTION

W

HILE many efficient and reliable algorithms have been designed for typical supervised tasks such as regression and classification, learning general functional dependencies between arbitrary input and output spaces is still a considerable challenge. The task of structured output prediction extends the well-studied classification and regression tasks to a most general level, where the outputs can be complex structured objects such as graphs, alignments, sequences, or sets [1]–[3]. One of the most difficult questions arising in this context is the one of how to efficiently learn the prediction function. In this paper, we focus on a special task of structured output prediction called label sequence learning [4]–[7], which concerns the problem of predicting a state sequence for an observation sequence, where the states may encode any labeling information, e.g., an annotation or segmentation of the sequence. A proven methodology to this end consists in the structured support vector machine (SVM) [1]–[3], [8], [9]—a generalized type of SVM with its loss-sensitive extensions known as margin and slack rescaling. While margin rescaling has been studied extensively since it can often be handled by a generalized version of Viterbi’s algorithm [10]–[14], slack rescaling is conjectured to be more effective, generally providing a higher prediction accuracy. Up to date, the latter is, however, not popular due to the lack of efficient algorithms solving the corresponding inference task. Even margin rescaling is applicable only to a small fraction of existing learning problems, where the loss function can be decomposed in the same way as the feature map. In this paper, we present a general framework for both rescaling approaches that can handle a very large class of inference problems, that is, the ones that are based on the Hamming-type loss functions and whose joint feature maps fulfill a decomposability requirement. We develop a generic

2162-237X © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

algorithm and prove that it always finds an optimal solution in time polynomial in the size of the input data, which is a considerable improvement over more straightforward algorithms that have exponential runtime. The remainder of this paper is organized as follows. In Section II, we introduce the task of label sequence learning. Section III is concerned with corresponding inference tasks. Here, we derive a general representation for different inference problems leading to a generic formulation of algorithms that is independent of a specific learning task. In particular, we define the notion of decomposability for a joint feature map that characterizes a class of inference problems, which can be dealt with by means of dynamic programming. Finally, we show how such inference problems can be solved algorithmically. To assess the resulting behavior of the presented algorithm with respect to the learning process, in Section IV, we perform the experiments on synthetic data simulating the computational biology problem of gene finding using a cutting-plane algorithm for training. In Section V, we summarize the achieved results followed by a discussion.

II. L ABEL S EQUENCE L EARNING IN A N UTSHELL In label sequence learning, we aim at predicting a label sequence y for a given observation sequence x by learning a functional relationship f : X → Y between an input space X and an output space Y , based on a training sample of input– output pairs (x 1 , y1 ), . . . , (x n , yn ) ∈ X × Y , independently drawn from some fixed but unknown probability distribution over X × Y . To simplify the presentation, we assume that all the sequences are of the same length. In particular, we define the input space to be X = Rr×T with r, T ∈ N and the output space Y to consist of all the possible label sequences over a finite alphabet with | | = N ∈ N, i.e., Y := {(y1 , . . . , yT )|∀t ∈ {1, . . . , T } : yt ∈ }, where for notational convenience, we denote the possible states for single sites yt ∈ in the sequence y by the numbers 1, . . . , N. A common approach is to learn a joint discriminant function that assigns each pair (x, y) of sequences with a real number encoding a degree of compatibility between input x and output y. Hence, by maximizing this function over all y ∈ Y , we can determine, for a given input x, the most compatible output f w (x) = arg max y∈Y w, (x, y) with  being a vector-valued mapping called joint feature map, whose concrete form depends on the nature of the underlying learning task. Apart from being able to capture important relationships in the data in order to do an accurate prediction, the actual form of  has a strong influence on the efficiency of learning and prediction processes. One popular choice of the joint feature map  is based on the concept of hidden Markov models, which restrict possible dependencies between the states in a label sequence by treating it as a Markov chain. This results in a formulation, which can be addressed by dynamic programming enabling an efficient inference. Following this chain of reasoning, we can specify a joint feature map  including the interactions between input features (x t ) (for some mapping ) and labels yt as well as the interactions

871

between nearby label variables yt −1, yt (see [1, p. 1475])   T (x t ) ⊗ (yt ) (1) (x, y) = T t =1 t =2 (yt −1 ) ⊗ (yt ) where (yt ) = (δ(yt , 1), . . . , δ(yt , N)) ∈ {0, 1} N and δ is the Kronecker delta function, yielding one if the arguments are equal and zero otherwise. A. Structural SVMs To learn the prediction function fw , we can train a structural SVM [2], [8], [9] using one of its loss-sensitive extensions either margin or slack rescaling. Following a prevalent line of research [1], we consider, here, functions:  : Y × Y → R, (y, yˆ ) → (y, yˆ )

(2)

satisfying (y, y) = 0 and (y, yˆ ) > 0 for y = yˆ . Intuitively, (y, yˆ ) quantifies the discrepancy or distance between two outputs y and yˆ . In the following, we briefly review the both rescaling methods. In slack rescaling, we scale the slack variables ξi according to the loss incurred in each of the linear constraints, which gives rise to the following optimization problem: n C 1 w 2 + ξi w,ξ 0 2 n

min

i=1

s.t. ∀ yˆ ∈ Y \{y1 } : w, δ1 ( yˆ )  1 − .. . s.t. ∀ yˆ ∈ Y \{yn } : w, δn ( yˆ )  1 −

ξ1 (y1 , yˆ ) ξn (yn , yˆ )

(3)

where, for notational convenience, we define δi ( yˆ ) = (x i , yi ) − (x i , yˆ ). In margin rescaling, we scale the margin according to incurred loss as follows: n C 1 w 2 + ξi ξ 0 2 n

min

w,

i=1

s.t. ∀ yˆ ∈ Y \{y1 } : w, δ1 ( yˆ )  (y1 , yˆ ) − ξ1 .. . s.t. ∀ yˆ ∈ Y \{yn } : w, δn ( yˆ )  (yn , yˆ ) − ξn .

(4)

Fig. 1 shows the effect of rescaling on slack variables and margin. In case of slack rescaling in Fig. 1(a), we observe that the outputs y and y violate the margin to the same extent ξ y /(y, y ) = ξ y /(y, y ), but, since their corresponding loss values are different, the output y contributes a greater value ξ y to the objective function than the output y . In case of margin rescaling, we observe a similar effect. Although the outputs y and y violate the margin by the same value ξ y = ξ y (before rescaling), because of (y, y ) > (y, y ), the output y contributes a greater value (y, y ) − 1 + ξ y to the objective function than the value (y, y ) − 1 + ξ y of the output y . However, a potential disadvantage of the margin rescaling (when contrasted to slack rescaling) is that

872

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

efficiently by dynamic programming [15, pp. 359–397]. For the case of inference problems in (5), it exploits certain factoring properties of , which we define formally in a moment. For notational convenience, we first introduce the following projections: for each i ∈ {1, . . . , T } and each  ∈ N, we define a map πi (y) := (ymax{i−,1} , . . . , yi−1 , yi ) Fig. 1. Illustration of the behavior of slack variables ξi due to the slack and margin rescaling, respectively. Each vertical axis corresponds to a subset of constraints in a corresponding optimization problem with respect to the same data point (x, y). The score value of the true output y is marked by a red bar, whereas the blue bars mark the score values of two support sequences y and y with (y, y ) > (y, y ). (a) Green bars illustrate the effect of slack rescaling on the corresponding slack variables, where the distance between a red bar and the dotted line corresponds to the margin. (b) Distance between a dotted black line and a red bar corresponds to the margin of the length one, whereas the effect on slack variables due to the margin rescaling is visualized by the green dotted line.

it seems to give more importance to labelings having large errors—regardless of how well they are separated from the true output. To avoid generation of exponentially many constraints while solving the QP in (3) or (4), we can use the cutting-plane algorithm [8] based on the concept of a separation oracle, which involves, in each iteration, the computation of a sequence y ∗ = arg max yˆ ∈Y (1 − w, δi ( yˆ )) · (yi , yˆ ) for slack rescaling and y ∗ = arg max yˆ ∈Y (yi , yˆ ) − w, δi ( yˆ ) for margin rescaling, respectively, corresponding to the most violation of the i th constraint. III. I NFERENCE P ROBLEMS In this section, we consider a special class of inference problems in the task of label sequence learning focusing our attention on two special forms of the general inference problem, which are referred here to as linear and loss augmented inference. For the case where the underlying problem has special factoring properties, we describe efficient algorithms, which find an optimal solution in polynomial time. A. Linear Inference As already mentioned, in label sequence learning, we aim at learning a prediction function of the form y ∗ = arg max w, (x, yˆ ).

(5)

yˆ ∈Y

Such optimization problems, where we aim at determining an optimal assignment y ∗ ∈ Y given an input x ∈ X, are typically referred to as the inference or decoding problems. Here, we call the problems as in (5) linear inference problems, suggesting that the objective function is linear in the joint feature vector. Since the cardinality |Y | of the output space grows exponentially in the description length of yˆ ∈ Y , the exhaustive searching is computational infeasible. However, combinatorial optimization problems with optimal substructure, i.e., such problems of which each optimal solution contains within it optimal solutions to the subproblems, can often be solved

(6)

the local projection of order  at position i . The main goal of these projection maps is to reduce the size of the formulas to keep the notation compact. Therefore, for a given sequence y, the map πi cuts out a consecutive subsequence of the length at most  + 1 consisting of term yi and its  predecessors if they exist. Example 1: For projection order  ∈ {0, 1, 2} and position i ∈ {1, 2, . . . , T }, we obtain π10 (y) = y1 , π20 (y) = y2 , π30 (y) = y3 , .. .

π11 (y) = y1 , π21 (y) = (y1 , y2 ), π31 (y) = (y2 , y3 ), .. .

π12 (y) = y1 π22 (y) = (y1 , y2 ) π32 (y) = (y1 , y2 , y3 ) .. .

πT0 (y) = yT , πT1 (y) = (yT −1 , yT ), πT2 (y) = (yT −2 , yT −1 , yT ).

Now, building on the notation of local projections, we define the term of the decomposability for a joint feature map. Definition 1 (Decomposability of Joint Feature Maps): We say a joint feature map  : X × Y → Rd , (x, y) → (x, y)

(7)

for d ∈ N is additively decomposable of order  ∈ N if for each i ∈ {1, . . . , T } there is a map   ψi : (x, πi (y)) → ψi x, πi (y) ∈ Rd (8) such that for all x, y the following equality holds: (x, y) =

T 

  ψi x, πi (y) .

(9)

i=1

Next, we give a short example to clarify the above definition. Example 2: Consider the joint feature map in (1). It is additively decomposable of order  = 1, since we can write  T  (x i ) ⊗ (yi ) i=1 (x, y) = T i=2 (yi−1 ) ⊗ (yi )      T (x i ) ⊗ (yi ) (x 1 ) ⊗ (y1 ) = + 0 (yi−1 ) ⊗ (yi ) 

i=2 

=:ψ1 (x,y1 )=ψ1 (x,π11 (y))

=

T 

=:ψi (x,yi−1 ,yi )=ψi (x,πi1 (y))

  ψi x, πi1 (y) .

i=1

Therefore, if a joint feature map is additively decomposable of order , it can be written as a sum of vectors, where each vector depends only locally on the sequence y in the sense of the map πi (y). For such a , we can reformulate the righthand side of the inference problem (5) as follows:

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

873



T w, (x, yˆ ) = w, i=1 ψi (x, πi ( yˆ ))

   T T = i=1 w, ψi (x, πi ( yˆ )) = i=1 gi πi ( yˆ )

   =: gi πi ( yˆ )

for some real-valued functions gi . Inference problems having a general representation y ∗ = arg max yˆ ∈Y

T 

  gi πi ( yˆ )

(10)

i=1

can be solved via dynamic programming in O(T · N +1 ) time, similar to the well-known Viterbi’s algorithm for hidden Markov models [10], [16]. To do so, we stepwise compute quantities Vt ( j1 , . . . , j ) for all   t  T and 1  j  N according to the following equation: Vt ( j1, . . . , j) :=

t 

max

yˆ =( yˆ1 ,..., yˆt− , j1 ,..., j )

  gi πi ( yˆ ) .

(11)

i=1

We interpret Vt ( j1, . . . , j) as the optimal value of a subproblem of (10) for shorter sequences of the length t, where the last  states j1, . . . , j of the corresponding solution are fixed. Furthermore, we save the preceding state of the state sequence j1, . . . , j in a structure S as follows: ∀t ∈ {, . . . , T } ∀ j ∈ {1, . . . , N} : where  t, j t, j  y1 , . . . , yt :=

t, j

St ( j1 , . . . , j ) := yt − t

arg max yˆ =( yˆ1 ,..., yˆt− , j1 ,..., j )

i=1

  gi πi ( yˆ ) .

If the solution y t, j is not unique, so that there are several optimal predecessors of the sequence j1, . . . , j, we choose t, j one with the smallest value for yt −1 as a tiebreaker. In this notation, the last state yT∗ of an optimal solution y ∗ is given by yT∗ = arg max{VT (i )} 1i N

and the remaining states can be computed in a top-down manner by yt∗ = St + (yt∗+1 , . . . , yt∗+ )

t = T − , . . . , 1.

The pseudocode of the above algorithm is given in Appendix A. Since we are initially interested in solving (5), we have to consider the time needed to compute the values gi (πi ( yˆ)). However, the good news is that the vectors ψi x, πi ( yˆ ) are usually extremely sparse. So that, we can often assume the time needed to compute gi πi ( yˆ ) to be constant O(1). B. Loss Augmented Inference Loss augmented inference is similar to the linear inference problems, but the objective function is extended by a loss mapping  as follows: y ∗ = arg max w, (x, y) ˆ  (y, yˆ ),  ∈ {+, ·}

(12)

yˆ ∈Y

where y is a fixed (true) output for a given input x. To train a structural SVM with margin or slack rescaling using

the cutting-plane algorithm, we have to solve an appropriate loss augmented inference problem. For the general case of an arbitrary loss function , it seems that we cannot do better than exhaustive search even if the joint feature map is decomposable. However, for the special case of the Hamming loss, which can be decomposed additively over the sites of the sequences, we can develop algorithms, which find an optimal solution for such inference problems in polynomial time. More precisely, we consider here loss functions of the form  = h ◦  H , where h : R → R is an arbitrary function and  H is the Hamming loss, which is defined as  H : Y × Y → N,

(y, yˆ ) →

T 

½[yi = yˆi ]

(13)

i=1

where ½[yi = yˆi ] is an indicator function yielding one if the expression [yi = yˆi ] is true and zero otherwise. Furthermore, we assume the joint feature map  to be additively decomposable of order , i.e., there functions g1 , . . . , gT  T are real-valued with w, (x, y) ˆ = i=1 gi πi ( yˆ ) . Using this terminology, (12) writes   T    ∗  y = arg max gi πi ( yˆ )  h( H (y, yˆ )),  ∈ {+, ·}. yˆ ∈Y

i=1

(14) In the following, we show how to solve inference problems in the general form (14) algorithmically. One important observation here is that the Hamming loss only takes a finite number of nonnegative integer values so that we can define the following term: L t,k ( j1, . . . , j ) :=

max

yˆ =( yˆ1 ,..., yˆt− , j1 ,..., j ): H (y, yˆ )=k

t 

  gi πi ( yˆ )

i=1

(15) if the maximum exists, i.e., the set over which we maximize is not empty. We interpret L t,k ( j1, . . . , j ) as a corresponding sum value of an optimal solution of a subproblem of (14), where t ∈ {, . . . , T } denotes the length of the sequence yˆ , j1, . . . , j ∈ {1, . . . , N} are the last (fixed) states in that sequence, and k ∈ {0, . . . , t} is the value of the Hamming loss  H (y, yˆ ) with y being reduced to the length of yˆ , i.e., y = (y1 , . . . , yt ). In this notation, the following holds: max

L T ,k ( j1, . . . , j)  h(k)  T     = max gi πi ( yˆ )  h( H (y, yˆ )). (16)

1 j1 ,..., j N, 0k T

yˆ ∈Y

i=1

That is, if we know the optimal sum values L T ,k ( j1, . . . , j ) of the subproblems of (14) for all 1  j1, . . . , j  N and 0  k  T , we can determine the optimal value of the base problem (14) too, by solving (16). The following theorem shows how to compute the optimal sum values L t,k ( j1, . . . , j ) of the subproblems recursively. We first present the formal statement for an arbitrary  ∈ N and then briefly demonstrate how it can be interpreted for the special case  = 1. Theorem 1: Let t ∈ {+1, . . . , T }, j1 , . . . , j ∈ {1, . . . , N}, ˆ . . . , t −  + k} ˆ with kˆ =  H (( j1, . . . , j), k ∈ {k,

874

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

(yt −+1, . . . , yt )) and a fixed sequence y = (y1 , . . . , yT ) ∈ Y be given. Then, the following equality holds:

Algorithm 1 (Loss Augmented Viterbi) Find an optimal solution y ∗ of the problem (14)

L t,k ( j1 , . . . , j ) ⎧ max L t −1,k−1 (i, j1 , . . . , j−1 ) + gt (i, j1 , . . . , j ), j = yt ⎪ ⎨1 i N = ⎪ j = yt . ⎩ max L t −1,k (i, j1 , . . . , j−1 ) + gt (i, j1 , . . . , j ), 1i N

Let us take a look at the case  = 1. The calculation rule in the above formula has then the following shape: ⎧ ⎪ max L (i ) + gt (i, j ), j = yt ⎪ ⎨1i N t −1,k−1 L t,k ( j ) = ⎪ ⎪ ⎩ max L t −1,k (i ) + gt (i, j ), j = yt . 1i N

Finally, we can solve the inference problem (14) algorithmically as follows. After the initialization part, we compute, in a bottom-up manner (i.e., all the combinations for t before t +1), the values L t,k ( j1, . . . , j ) using the formula in Theorem 1, so that we can determine the optimal value L ∗ according to (16). Simultaneously, we store the preceding state of the state sequence j1, . . . , j in the structure S, which we use later to retrieve a corresponding optimal solution, i.e., for all t ∈ {, . . . , T }, k ∈ {0, . . . , t} and j1 , . . . , j ∈ {1, . . . , N}, we set t, j,k

St,k ( j1 , . . . , j ) := yt − where 

t, j,k t, j,k  := y1 , . . . ,yt

arg max

t    gi πi ( yˆ )

yˆ =( yˆ1 ,..., yˆt− , j1 ,..., j ): H (y, yˆ )=k i=1

if the set { yˆ = ( yˆ1 , . . . , yˆt −, j1 , . . . , j) :  H (y, yˆ ) = k} is not empty. If the solution y t, j,k is not unique so that there are several potential predecessors of the state sequence j1, . . . , j, we can choose any of them. Here, we take as a tiebreaker the t, j,k state having the smallest value for yt − . In this notation, the last states yT∗ −+1 , . . . , yT∗ of the sequence y ∗ are given by   max L T ,k ( j1 , . . . , j ) yT∗ −+1 , . . . , yT∗ = arg max 1 j1 ,..., j N 1k T

and the remaining states can be computed in a top-down manner by yt∗ = St +,K (yt∗+1, . . . , yt∗+ ) for t = T − , . . . , 1, where K =  H ((y1 , . . . , yt + ), (y1∗ , . . . , yt∗+)). Thus, we obtain an efficient algorithm for solving optimization problem (14), which is summarized in Algorithm 1. Fig. 2 shows a simulation of the above algorithm for the case  = ·, T = 4, N = 3, and y = (3, 2, 1, 3) with h being the identity. It is illustrated, how the variables L t,k ( j ) and St,k ( j ) evolve over the iterations t = 1, . . . , T . Each block corresponding to a value t ∈ {1, 2, 3, 4} in the recursion part represents one iteration step of the outer for-loop in lines 5–13 and shows the content of the variable S represented by the edges between nodes with red edges marking the decision in the current iteration step: the nodes (i, t − 1) and ( j, t) belonging to the same loss value k ∈ {1, 2, 3, 4} are connected by an edge if and only if St,k ( j ) = i .1 For example, 1 We omitted the value k = 0 in Fig. 2 for simplicity, since for that case, the algorithm constructs only one path reproducing the true sequence y.

after the second iteration (t = 2), we have : S2,1 (1) = 3, S2,1 (2) = 1, S2,1 (3) = 3, S2,2 (1) = 1, and S2,2 (3) = 2. The corresponding values gt (i, j ) represented by a matrix are shown on the right: for example, g2 (1, 2) = 5 and g2 (2, 1) = 2. The circled numbers represent the current values L t,k ( j ), where the values of the missing circles are not defined. The following example illustrates how the recursion rule form Theorem 1 is applied during the simulation in Fig. 2. Example 3: After the initialization part, we have a situation, where L 1,1 (1) = 2, L 1,1 (2) = 1. In the second iteration of the outer for-loop (t = 2), we get the following values of the variables L t,k ( j ) and St,k ( j ).

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

875

Fig. 3. Backtracking part of the simulation of Algorithm 1 in Fig. 2, where the red nodes mark an optimal solution y ∗ = (2, 3, 3, 1). The corresponding sum value is 13 yielding the optimal value L ∗ = 13 · h(4) = 13 · 4 = 52.

Case 5: (t, k, j ) = (2, 1, 3), i.e., j = yt and k = 1 L 2,1 (3) = g1 (3) + g2 (3, 3) = 3 + 3 = 6 S2,1 (3) = 3. Case 6: (t, k, j ) = (2, 2, 3), i.e., j = yt and k > 1 L 2,2 (3) = max

1i N

L 1,1 (i ) + g2 (i, 3)

= L 1,1 (2) + g2 (2, 3) = 1 + 4 = 5 S2,2 (3) = 2.

Fig. 2. Simulation of Algorithm 1 for the case  = ·, T = 4, N = 3, and y = (3, 2, 1, 3) with h being the identity. On the left, the values St,k ( j) of the variable S are presented, where the edges between nodes denote the predecessor relationship. The red edges mark the decision in the current iteration step and the circled numbers correspond to the values L t,k ( j). On the right, the values gt (i, j) are shown in a matrix form.

Case 1: (t, k, j ) = (2, 1, 1), i.e., j = yt , k = 1 L 2,1 (1) = g1(3) + g2 (3, 1) = 3 + 1 = 4 S2,1 (1) = 3. Case 2: (t, k, j ) = (2, 2, 1), i.e., j = yt and k > 1 L 2,2 (1) = max

1i N

L 1,1 (i ) + g2 (i, 1)

= L 1,1 (1) + g2 (1, 1) = 2 + 4 = 6 S2,2 (1) = 1. Case 3: (t, k, j ) = (2, 1, 2), i.e., j = yt and t = k L 2,1 (2) = max

1i N

L 1,1 (i ) + g2 (i, 2)

= L 1,1 (1) + g2 (1, 2) = 2 + 5 = 7 S2,1 (2) = 1. Case 4: (t, k, j ) = (2, 2, 2), i.e., j = yt and t = k. The value is not defined.

The remaining iterations (t = 3 and t = 4) are handled in a similar way. The corresponding backtracking part yielding an optimal solution y ∗ = (2, 3, 3, 1) with the optimal sum value 13 is shown in Fig. 3. The runtime of Algorithm 1 is dominated by the recursion part in lines 5–13. Since each computation inside the loops is done in O(N) time, the overall time complexity is O(T 2 · N +1 ). The correctness follows from Theorem 1 and (16). More precisely, Theorem 1 ensures the optimality of the values L T ,k ( j1, . . . , j ), from which we then can compute the optimal value of the base problem (14) using (16). The actual solution y ∗ is obtained by following the computation of quantities L t,k ( j1 , . . . , j ) in the reversed direction, which we here do by means of the structure S saving the meta information about preceding states in dependence on the state sequence j1 , . . . , j . We summarize these main results in the following proposition. Proposition 1 (Correctness and Complexity of Algorithm 1): Algorithm 1 always finds an optimal solution of the problem (14) in O(T 2 · N +1 ) time. Finally, we can use the above results to solve the corresponding inference problems when training a structural SVM by either margin or slack rescaling. Proposition 2 (Algorithmic Complexity of the Slack and Margin Rescaling Inference Problems): Let h : R → R be an arbitrary function and (x, y) ∈ X × Y a given data pair. If the joint feature map  is additively decomposable of order , then we can solve the corresponding inference problem (for  = h ◦  H ) resulting from either slack or margin rescaling using Algorithm 1 in O(T 2 · N +1 ) time.

876

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

Proof: During training via slack rescaling, the corresponding separation oracle has to solve the following loss augmented inference problem:

such that for all y, yˆ the following equality holds: (y, yˆ ) =

y ∗ = arg max (1 − w, (x, y) − (x, yˆ )) · (y, yˆ )

(1 − w, (x, y) − (x, yˆ )) · (y, yˆ ) = (1 − w, (x, y) + w, (x, yˆ )) · (y, yˆ )     w (x, yˆ ) · (y, yˆ ) = , 1 1 − w, (x, y)

  ˜ (x, yˆ )

W H : Y × Y → R,

˜ = w, ˜ (x, yˆ ) · (y, yˆ ).

(y, yˆ ) →

T 

wi (πi (y), πi ( yˆ )) (20)

i=1

˜ is additively In particular, we further can assume that  decomposable of order : if there are maps ψi as in Definition 1, we easily can construct the corresponding maps ψ˜ i by setting ⎞ ⎛T     ( yˆ )) (x, y ˆ ) ψ (x, π i ⎠ ˜ (x, yˆ ) = = ⎝i=1 i 1 − w, (x, y) 1 − w, (x, y)   T  ψi (x, π  ( yˆ )) i = . 1−w,(x,y) T i=1 

 ˜ ψ(x,π i ( yˆ )):=

On the other hand, for margin rescaling, we have to solve the following inference problem: y ∗ = arg max (y, yˆ ) − w, (x, y) − (x, yˆ ) yˆ ∈Y

where for i = 1, . . . , T     wi : πi (y), πi ( yˆ ) → wi πi (y), πi ( yˆ ) are arbitrary functions, obviously is additively decomposable of order . Similar to the decomposability of joint feature maps, if a loss function is additively decomposable of order , it can be written as a sum of further functions, where each one of them depends only locally on the sequence y in the sense of the map πi (y). Using Definition 1, we can now formulate the following statement. Proposition 1: Let h : R → R be an arbitrary affine function and (x, y) ∈ X × Y a given data pair. If the joint feature map  as well as the loss function  both are additively decomposable of orders 1 and 2 , respectively, then we can solve the problem

which is equivalent to

y ∗ = arg max w, (x, yˆ ) + h((y, yˆ ))

y ∗ = arg max (y, yˆ ) + w, (x, y) ˆ

(21)

yˆ ∈Y

yˆ ∈Y

since w, (x, y) is constant over all yˆ ∈ Y and has therefore no influence on the optimal value. Therefore, for both scaling methods, we can consider a more general problem instead yˆ ∈Y

(19)

An obvious example for such functions is the Hamming loss, where  = 0 and i (y, πi0 ( yˆ )) = ½[yi = yˆi ] . A more general function, which cannot be handled by Algorithm 1, is the following weighted Hamming loss. Example 4: The weighted Hamming loss on segments of the length 

which can be written in more compact form as follows:

y ∗ = arg max w, (x, yˆ )  (y, yˆ ),

  i y, πi ( yˆ ) .

i=1

yˆ ∈Y



T 

 ∈ {·, +}

in O(T · N max{1 ,2 }+1 ) time using the generalized Viterbi’s algorithm. Proof: We note that since h is an affine function, there are a, b ∈ R so that h(x) = a · x + b. It suffices to show that we can write (21) in (10) as follows:

which for  = h ◦  H can be solved (according to Propostion III-B) using Algorithm 1 in O(T 2 · N +1 ) time. It is also worth to mention that for special case  = + if h is an affine function, we can always find an optimal solution of (14) in linear time with respect to the length of the sequences. For that case, we can even use more general loss functions instead of the Hamming loss provided these functions can be decomposed in the same fashion as the feature map. Definition 1 (Decomposability of Loss Functions): We say a loss function

w, (x, y) ˆ + h((y, yˆ ))   T T       (∗) 1 = w, ψ j x, π j ( yˆ ) + a · i y, πi2 ( yˆ ) + b

 : Y × Y → R, (y, yˆ ) → (y, yˆ )

where in (∗), we used the corresponding assumption about the decomposability of  and . Since b is a constant, i.e., it has no influence on the optimal value, we can use the generalized Viterbi’s algorithm (see Appendix ) to find an optimal solution of the corresponding problem in O(T · N max{1 ,2 }+1 ) time.

(17)

is additively decomposable of order , if for each i ∈ {1, . . . , T }, there is a map     i : y, πi ( yˆ ) → i y, πi ( yˆ ) ∈ R (18)

j =1

=

i=1

T 

    w, ψi x, πi1 ( yˆ ) + a · i y, πi2 ( yˆ ) +b

i=1    max{1 ,2 } =:gi πi ( yˆ )

=

T 

  max(1 ,2 ) g i πi ( yˆ ) + b

i=1

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

877

Fig. 4. Illustration of the averaged running time (in seconds) over 10 repetitions of loss augmented Viterbi’s algorithm, generalized Viterbi’s algorithm, and a brute force algorithm via exhaustive searching in dependence on the length of the involved sequences for N = 4. The exhaustive searching requires already for the short sequences of the length T = 10 about 160 s, while the computation time of the generalized Viterbi’s algorithm for sequence length T = 2000 is less than 0.2 s.

IV. E XPERIMENTS AND E MPIRICAL A NALYSIS Let us first take a moment to look at the inference problems (14) independently of the structured output learning. To get an idea how the presented algorithms consume the computational time, in practice, we show the corresponding running time as a function of the sequence length in Fig. 4.2 As one can see, for the sequence length T = 2000, it takes about 1 min to compute a maximum violated constraint via Algorithm 1 restricting the number of effectively handleable training points to a few hundred examples. Furthermore, we note the difference between the slack and margin rescaling because the corresponding loss function contributes differently to the objective value. It is shown in Fig. 5 how the objective values are distributed over the length of the involved sequences and the values of the corresponding Hamming loss with respect to the true output. As one could expect, the objective value of both scaling methods is growing with increased sequence length and the Hamming loss. The difference here is that the highest objective values for slack rescaling (due to the multiplicative contribution of the loss) are more concentrated in the right lower corner of the heat map corresponding to the longest sequences with biggest Hamming loss. For margin rescaling, where the loss contributes additively, the distribution of corresponding objective values is smoother. In addition, the range of taken values differs significantly: for margin rescaling, the values vary between zero and 105 and for slack rescaling, between zero and 108 , suggesting that the objective value for the latter grows much faster with increase in the sequence length. In the following, we present the empirical results of the experiments on synthetic data simulating the computational biology problem of gene finding. In particular, since we know the underlying distribution generating the data, we investigate the effects of presented algorithm in dependence on different problem settings by using a cutting-plane algorithm for training. 2 The experiment was performed on a Mac with a 3.06-GHz Intel Core 2 Duo processor.

Fig. 5. Illustration of the distribution of the highest objective values of (14) for margin ( = +) and slack rescaling ( = ·) in dependence on the sequence length (the vertical axis) and the Hamming loss with respect to the true output (the horizontal axis). The pictures were constructed by averaging over 10 randomly generated data sets, where the values gt (i, j) were drawn uniformly at random from an interval [−50, 50] with N = 4 and h being identity.

A. Data Generation We generate artificial data sets with X = Rr×T and Y = {1, −1}T , where r = 4 is the number of observation features and T = 100 is the length of the sequences. Each data point (x, y) ∈ X × Y was constructed in the following way. We first initialize a label sequence y ∈ Y with entries −1 and then introduce positive blocks, where we choose the number, the position, and the length of the latter uniformly at random from a finite set of values: minimal and maximal number of positive blocks are zero and four; minimal and maximal length of positive blocks are 10 and 40, respectively. Furthermore, the example sequences generated in this way are extended by start and end states resulting in a data model represented by a state diagram in Fig. 6 on the right. The corresponding observation sequences were generated by adding a Gaussian noise with zero mean and standard deviation σ = 1, 2, 5, 7, 10, 12 to the label sequence. B. Experimental Setting We performed a number of experiments by training via cutting-plane algorithm on different data sets generated by varying the standard deviation σ of the Gaussian noise. As a joint feature map, we used one similar to that in (1), where we subdivided the continuous signals of observation sequences by additionally using histogram features with 10 bins. For each value of σ , we first generated 10 different data sets each consisting of 3000 examples, which we then

878

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

set (optimal C values are attained in the interior of the grid). C. Results

Fig. 6. Illustration of data model for synthetic experiment. The graphic on the left shows a visualization of an example from the synthetic data. The first row (red) represents the label sequence consisting of two positive (+1) and three negative (−1) blocks. The remaining rows (blue) are the corresponding observation sequences generated by adding a Gaussian noise with mean μ = 0 and standard deviation σ = 1 to the label sequence. The graph on the right provides a corresponding state diagram describing the data model. There can be arbitrary many positive and negative states in a label sequence, while the first site always corresponds to the start and the last site to the end state.

Fig. 7. (a) and (b) Averaged accuracy and averaged number of iterations taken by the cutting-plane algorithm for slack rescaling over 10 different synthetic data sets as a function of the number of training examples. Each curve corresponds to a fixed value of the standard deviation σ ∈ {1, 2, 5, 7, 10, 12} of the Gaussian noise in the data.

partitioned in training, validation, and test sets of equal size (1000 examples in each set). During the training phase, we also changed systematic the size of the training set by monotonically increasing the number of the training examples. Soft margin hyperparameter C was tuned by grid search over a range of values {10i | i = −8, −7, . . . , 6} on the validation

Fig. 7 shows the experimental results for slack rescaling in terms of averaged accuracy (over 10 different data sets) and the averaged number of iteration taken by the cuttingplane algorithm in dependence on the number of training examples and the standard deviation value σ of the noise. It shows a consistent behavior in the sense that the accuracy of a prediction increases with the growing number of training examples and decreases with increasing amount of noise, while the number of iteration taken by the cutting-plane algorithm decreases with increasing number of training examples. For margin rescaling, in our synthetic example, we get comparable results. V. C ONCLUSION Label sequence learning is a versatile but costly to optimize tool for learning when a joint feature map can define the dependencies between inputs and outputs. We have addressed the computational problems inherent to its inference algorithms. We contributed by defining a general framework (including Algorithm 1, the loss augmented Viterbi) based on the decomposability of the underlying joint feature map. Combined with a cutting-plane algorithm, it enables an efficient training of structural SVMs for label sequence learning via slack rescaling in quadratic time with respect to the length of the sequences. While the more frequently used learning approach of margin rescaling is limited to the cases where the loss function can be decomposed in the same manner as the joint feature map, Algorithm 1 extends the set of manageable loss functions by a Hamming loss rescaled in an arbitrary way, which in general yields a nondecomposable loss function, and therefore cannot be handled by previously used inference approach via generalized Viterbi’s algorithm. Our approach thus presents an abstract unified view on inference problems, which is decoupled from the underlying learning task. Once the decomposability property of the joint feature map holds true, we can handle different problem types independently of the concrete form of the joint feature map. In label sequence learning, the decomposability is naturally fulfilled; in addition, since we aim at predicting a sequence of labels, a natural choice for the loss function  is the number of incorrect labels or label segments. Therefore, the Hamming loss already seems to be appropriate for many generic tasks, for example, gene finding/annotation, named entity recognition or part-of-speech tagging, just to mention a few. Use of an additional wrapper function h also extends the set of manageable loss functions, so that the Hamming loss can further be scaled in an arbitrary (especially nonlinear) way. The reason for the efficiency lies in the form of dependencies between different labels in sequences modeled by a joint feature map. The concept of the decomposability enables us to divide a given inference problem in smaller overlapping subproblems, where we need to compute them only once and can save the results for reuse. This dramatically reduces the computational time. Finally, we can compute the actual

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

879

solution from the solutions to the subproblems based on the notion of optimal substructure of the underlying task. In other words, the concept of decomposability of joint feature maps makes the technique of dynamic programming applicable to the task of label sequence learning. The range of potential applications, which can be handled by our approach, is limited by the length of the involved sequences. Due to the quadratic running time (with respect to the sequence length) of the presented algorithm for the loss augmented inference, in practice, we only can effective address the inference tasks, where the corresponding sequences are no longer than 2000–3000 sites. Otherwise longer computation and training times are to be expected. Future work will apply the novel techniques developed here to other structured learning scenarios and also study gene finding in computational biology, where slack rescaling is thought beneficial.

Algorithm 2 Generalized Viterbi’s Algorithm: Find an optimal solution y ∗ of (10)

A PPENDIX A G ENERALIZED V ITERBI ’s A LGORITHM As mentioned in Section III, we can solve (10) by a generalized version of Viterbi’s algorithm given in Algorithm 2. A PPENDIX B P ROOF OF T HEOREM 1 Proof: For this proof, we introduce the following short notation: max

t,k| j1 ,..., j

:=

max

yˆ =( yˆ1 ,..., yˆt− , j1 ,..., j ):  H (y, yˆ )=k

where we optimize over the variables yˆ = ( yˆ1 , . . . , yˆt ) restricted to the subset with yˆt −+1 = j1, . . . , yˆt = j and  H (y, yˆ ) = k with respect to the given sequence y. Case 1 ( j = yt ): By definition of the term L t −1,k−1 (i, j1 , . . . , j−1) in (15), we have (15)

max L t −1,k−1 (i, j1 , . . . , j−1) + gt (i, j1, . . . , j) =

1i N

t −1 

max

max

1i N t −1,k−1|i, j1 ,..., j−1 (∗)

=

max

gm (πm ( yˆ )) + gt (i, j1 , . . . , j )

m=1 t 

t,k| j1 ,..., j

gi (πi ( yˆ )) = L t,k ( j1, . . . , j )

i=1

where for (∗) we merged the two max terms using that by assumption j = yt , which implies that the value of the Hamming loss increases from k − 1 to k. Case 2 ( j = yt ): By definition of the term L t −1,k (i, j1 , . . . , j−1) in (15), we have (15)

max L t −1,k (i, j1, . . . , j−1) + gt (i, j1 , . . . , j ) =

1i N

max

max

1i N t −1,k|i, j1 ,..., j−1 (∗)

=

max

t,k| j1 ,..., j

t −1 

gm (πm ( yˆ )) + gt (i, j1 , . . . , j )

m=1 t  i=1

gi (πi ( yˆ )) = L t,k ( j1 , . . . , j )

where for (∗), we merged the two max terms using that by assumption j = yt , which implies that the value of the Hamming loss does not change. ACKNOWLEDGMENT The authors would like to thank G. Rätsch for helpful discussions. R EFERENCES [1] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large margin methods for structured and interdependent output variables,” J. Mach. Learn. Res., vol. 6, pp. 1453–1484, Sep. 2005. [2] T. Joachims, T. Hofmann, Y. Yue, and C.-N. Yu, “Predicting structured objects with support vector machines,” Commun. ACM, Scratch Program. All, vol. 52, no. 11, pp. 97–104, Nov. 2009. [3] S. Sarawagi and R. Gupta, “Accurate max-margin training for structured output spaces,” in Proc. 25th ICML, 2008, pp. 888–895. [4] Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support vector machines,” in Proc. ICML, 2003, pp. 1–8. [5] Y. Altun and T. Hofmann, “Large margin methods for label sequence learning,” in Proc. 8th Eur. Conf. Speech Commun. Technol., 2003, pp. 1–4.

880

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 5, MAY 2014

[6] Y. Altun, T. Hofmann, and M. Johnson, “Discriminative learning for label sequences via boosting,” in Advances in Neural Information Processing Systems, vol. 15. Cambridge, MA, USA: MIT Press, 2002, pp. 977–984. [7] Y. Altun, M. Johnson, and T. Hofmann, “Investigating loss functions and optimization methods for discriminative learning of label sequences,” in Proc. EMNLP, 2003, pp. 145–152. [8] T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural SVMs,” Mach. Learn., vol. 77, no. 1, pp. 27–59, Oct. 2009. [9] H. Xue, S. Chen, and Q. Yang, “Structural support vector machine,” in Proc. 5th ISNN, 2008, pp. 501–511. [10] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [11] R. Esposito and D. P. Radicioni, “CarpeDiem: Optimizing the Viterbi algorithm and applications to supervised sequential learning,” J. Mach. Learn. Res., vol. 10, pp. 1851–1880, Jan. 2009. [12] J. S. Reeve, “A parallel Viterbi decoding algorithm,” Concurrency Comput., Pract. Exper., vol. 13, no. 2, pp. 95–102, 2000. [13] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov models,” IEEE ASSP Mag., vol. 3, no. 1, pp. 4–16, Jan. 1986. [14] A. Krogh, M. Brown, I. S. Mian, K. Sjšlander, and D. Haussler, “Hidden Markov models in computational biology: Applications to protein modeling,” J. Molecular Biol., vol. 235, no. 5, pp. 1501–1531, 1994. [15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 3rd ed. Cambridge, MA, USA: MIT Press, 2009. [16] G. D. Forney, “The Viterbi algorithm,” Proc. IEEE, vol. 61, no. 3, pp. 268–278, Mar. 1973. [17] K. Crammer and Y. Singer, “On the algorithmic implementation of multiclass kernel-based vector machines,” J. Mach. Learn. Res., vol. 2, pp. 265–292, Jan. 2001. [18] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vector machine learning for interdependent and structured output spaces,” in Proc. 21st ICML, 2004, p. 1–8. [19] Y. Altun and D. M. A. Chicago, “Maximum margin semi-supervised learning for structured variables,” in Advances in Neural Information Processing Systems, vol. 18. Cambridge, MA, USA: MIT Press, 2005. [20] B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2003. [21] H. D. Iii and D. Marcu, “Learning as search optimization: Approximate large margin methods for structured prediction,” in Proc. ICML, 2005, pp. 169–176. [22] J. Lafferty, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. 18th ICML, 2001, pp. 282–289. [23] C. H. Lampert, “Maximum margin multi-label structured prediction,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2011, pp. 289–297. [24] P. Pletscher, C. S. Ong, and J. M. Buhmann, “Entropy and margin maximization for structured output learning,” in Proc. ECML PKDD, 2010, pp. 83–98. [25] C.-N. J. Yu and T. Joachims, “Learning structural SVMs with latent variables,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 1169–1176. [26] G. Rätsch and S. Sonnenburg, “Large scale hidden semi-Markov SVMs,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2006. [27] W. Zhong, W. Pan, J. T. Kwok, and I. W. Tsang, “Incorporating the loss function into discriminative clustering of structured outputs,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1564–1575, Oct. 2010. [28] B. Zhao, J. Kwok, and C. Zhang, “Maximum margin clustering with multivariate loss function,” in Proc. 9th IEEE ICDM, Dec. 2009, pp. 637–646. [29] C. Cortes, M. Mohri, and J. Weston, “A general regression technique for learning transductions,” in Proc. ICML 2005, pp. 153–160. [30] N. Kaji, Y. Fujiwara, N. Yoshinaga, and M. Kitsuregawa, “Efficient staggered decoding for sequence labeling,” in Proc. 48th Annu. Meeting Assoc. Comput. Linguist., 2010, pp. 485–494. [31] S. Chatterjee and S. Russell, “A temporally abstracted Viterbi algorithm,” in Proc. UAI, 2011, pp. 96–104.

Alexander Bauer received the B.Sc. degree in mathematics and the Diploma degree in computer science from Technische Universität Berlin, Berlin, Germany, in 2011 and 2012, respectively. He is currently pursuing the Doctoral degree with the Machine Learning Group, Technische Universität Berlin headed by K.-R. Müller. His current research interests include different methods in the field of machine learning, in particular, learning with structured outputs.

Nico Görnitz received the Diploma degree in computer engineering from Technische Universität Berlin, Berlin, Germany. Currently, he is pursuing the Ph.D. degree with the Machine Learning Program, Berlin Institute of Technology, Berlin, headed by K.-R. Müller. He was with the Friedrich Miescher Laboratory, Max Planck Society, Tübingen, Germany, from 2010 to 2011, where he was co-advised by G. Rätsch. He is working on level-set estimation, semi-supervised learning, and anomaly detection as well as corresponding optimization methods. His current research interests include machine learning and its applications in computational biology and computer security, especially structured output prediction, specifically large-scale label sequence learning for transcript prediction.

Franziska Biegler received the Diploma degree in computer science from the University of Potsdam, Potsdam, Germany, in 2005, and the Ph.D. degree in computer science from the University of Western Ontario, London, ON, Canada, in 2009. Her Diploma and Ph.D. research was the area of formal languages and combinatorics on words. She has been a Post-Doctoral Fellow with KlausRobert Müller’s Research Group, Technische Universität Berlin, Berlin, Germany, since 2011. Her current research interests include accurate predictions of molecular properties as well as at the crossroads of machine learning and formal languages, e.g., automatic learning and analysis of sequences.

Klaus-Robert Müller received the Diploma degree in mathematical physics and the Ph.D. degree in computer science from Technische Universität Karlsruhe, Karlsruhe, Germany, in 1989 and 1992, respectively. He has been a Professor of computer science with Technische Universität Berlin, Berlin, Germany, since 2006, and he is directing the Bernstein Focus on Neurotechnology Berlin, Berlin. Since 2012, he has been a Distinguished Professor with Korea University, Seoul, Korea, within the WCU Program. After a Post-Doctoral at GMD-FIRST, Berlin, he was a Research Fellow with the University of Tokyo, Tokyo, Japan, from 1994 to 1995. In 1995, he built up the Intelligent Data Analysis Group, GMD-FIRST (later Fraunhofer FIRST) and directed it until 2008. From 1999 to 2006, he was a Professor with the University of Potsdam, Potsdam, Germany. His current research interests include intelligent data analysis, machine learning, signal processing, and brain computer interfaces. Dr. Müller received the Olympus Prize by the German Pattern Recognition Society, DAGM, in 1999, and the SEL Alcatel Communication Award in 2006. In 2012, he was elected to be a member of the German National Academy of Sciences Leopoldina.

BAUER et al.: EFFICIENT ALGORITHMS FOR EXACT INFERENCE

Marius Kloft received the Diploma degree in mathematics from Philipps-Universität Marburg, Marburg, Germany, with a thesis in algebraic geometry and the Ph.D. degree in computer science from Technische Universität Berlin, Berlin, Germany, where he was co-advised by K.-R. Müller and G. Blanchard. He was with UC Berkeley, Berkeley, CA, USA, where he advised by with P. Bartlett. After research visits to the Friedrich-Miescher-Laboratory of the Max Planck Society, Tubingen, Germany, and Korea University, Seoul, Korea, he is currently a PostDoctoral Fellow, jointly appointed with the Courant Institute of Mathematical Sciences and Memorial Sloan-Kettering Cancer Center, New York, NY, USA, with M. Mohri and G. Ratsch. His current research interests include machine learning with nonindependently and identically distributed data and applications to genomic cancer data.

881

Efficient algorithms for exact inference in sequence labeling SVMs.

The task of structured output prediction deals with learning general functional dependencies between arbitrary input and output spaces. In this contex...
2MB Sizes 0 Downloads 3 Views