jmlr jmlr2005 jmlr2005-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
Reference: text
sentIndex sentText sentNum sentScore
1 Unlike multiclass classification, where the output space consists of an arbitrary finite set of labels or class identifiers, Y = {1, . [sent-22, score-0.177]
2 Such problems arise in a variety of applications, ranging from multilabel classification and classification with class taxonomies, to label sequence learning, sequence alignment learning, and supervised grammar learning, to name just a few. [sent-27, score-0.462]
3 We approach these problems by generalizing large margin methods, more specifically multiclass support vector machines (SVMs) (Weston and Watkins, 1998; Crammer and Singer, 2001), to the broader problem of learning structured responses. [sent-34, score-0.244]
4 The structure of the output is modeled by a Markov network, and by employing a probabilistic interpretation of the dual formulation of the problem, Taskar et al. [sent-49, score-0.172]
5 The proposed reparameterization, however, assumes that the loss function can be decomposed in the the same fashion as the feature map, thus does not support arbitrary loss functions that may be appropriate for specific applications. [sent-51, score-0.234]
6 There, however, separate kernel functions are defined for the input and output space, with the idea to encode a priori knowledge about the similarity or loss function in output space. [sent-54, score-0.201]
7 onomies, label sequence learning, sequence alignment, and natural language parsing. [sent-62, score-0.234]
8 The rest of the paper is organized as follows: Section 2 presents the general framework of large margin learning over structured output spaces using representations of input-output pairs via joint feature maps. [sent-65, score-0.257]
9 As an illustrating example, which we will continue to use as a prototypical application in the sequel, consider the case of natural language parsing, where the function f maps a given sentence x to a parse tree y. [sent-70, score-0.266]
10 Using again natural language parsing as an illustrative example, we can chose F such that we get a model that is isomorphic to a probabilistic context free grammar (PCFG) (cf. [sent-81, score-0.283]
11 Each node in a parse tree y for a sentence x corresponds to grammar rule g j , which in turn has a score w j . [sent-83, score-0.426]
12 This score can thus be written in the form of Equation (2), with Ψ(x, y) denoting a histogram vector of counts (how often each grammar rule g j occurs in the tree y). [sent-87, score-0.241]
13 For example, in natural language parsing, a parse tree that is almost correct and differs from the correct parse in only one or a few nodes should be treated differently from a parse tree that is completely different. [sent-91, score-0.503]
14 Typically, the correctness of a predicted parse tree is measured by its F1 score (see e. [sent-92, score-0.199]
15 Of course, P is unknown and following the supervised learning paradigm, we assume that a finite training set of pairs S = {(xi , yi ) ∈ X × Y : i = 1, . [sent-102, score-0.186]
16 The condition of zero training error can then be compactly written as a set of nonlinear constraints max { w, Ψ(xi , y) } ≤ w, Ψ(xi , yi ) . [sent-119, score-0.261]
17 , n}, ∀y ∈ Y \ yi : w, Ψ(xi , yi ) − Ψ(xi , y) ≥ 0 . [sent-127, score-0.372]
18 (4) As we will often encounter terms involving feature vector differences of the type appearing in Equation (4), we define δΨi (y) ≡ Ψ(xi , yi ) − Ψ(xi , y) so that the constraints can be more compactly written as w, δΨi (y) ≥ 0. [sent-128, score-0.309]
19 the minimal differences between the score of the correct label yi and the closest runner-up ˆ y(w) = argmaxy=yi w, Ψ(xi , y) , is maximal. [sent-132, score-0.262]
20 As in the case of multiclass SVMs, there are at least two ways of introducing slack variables. [sent-146, score-0.259]
21 One may introduce a single slack variable ξi for violations of the nonlinear constraints 1457 T SOCHANTARIDIS , J OACHIMS , H OFMANN AND A LTUN (i. [sent-147, score-0.242]
22 every instance xi and output y = yi ) (Weston and Watkins, 1998; HarPeled et al. [sent-151, score-0.24]
23 Adding a penalty term that is linear in the slack variables to the objective, results in the quadratic program SVM1 : 1 w 2 min ξ w,ξ 2 + C n ∑ ξi n i=1 (7) s. [sent-156, score-0.185]
24 3 G ENERAL L OSS F UNCTIONS : S LACK R E - SCALING The first approach we propose for the case of arbitrary loss functions, is to re-scale the slack variables according to the loss incurred in each of the linear constraints. [sent-165, score-0.322]
25 Intuitively, violating a margin constraint involving a y = yi with high loss (yi , y) should be penalized more severely than a violation involving an output value with smaller loss. [sent-166, score-0.453]
26 This can be accomplished by multiplying the margin violation by the loss, or equivalently, by scaling the slack variable with the inverse loss, which yields SVM1 s : min ξ w,ξ 1 w 2 2 + C n ∑ ξi n i=1 s. [sent-167, score-0.206]
27 i Case 1: If f (xi ; w) = yi , then ξ∗ ≥ 0 = (yi , f (xi ; w)) and the loss is trivially upper bounded. [sent-174, score-0.279]
28 i ξ∗ ˆ Case 2: If y ≡ f (xi ; w) = yi , then w, δΨi (ˆ ) ≤ 0 and thus (yii ,y) ≥ 1 which is equivalent to y ξ∗ ≥ (yi , y). [sent-175, score-0.186]
29 4 G ENERAL L OSS F UNCTIONS : M ARGIN R E - SCALING In addition to this slack re-scaling approach, a second way to include loss functions is to re-scale the margin as proposed by Taskar et al. [sent-180, score-0.299]
30 Proof First note that each w is feasible for SVM1 s and SVM1 s in the sense that we can find slack variables such that all the constraints are satisfied. [sent-198, score-0.211]
31 In contrast, the margin re-scaling formulation is not invariant under scaling of the loss function. [sent-203, score-0.163]
32 A second disadvantage of the margin scaling approach is that it potentially gives significant weight to output values y ∈ Y that are not even close to being confusable with the target values yi , because every increase in the loss increases the required margin. [sent-207, score-0.403]
33 If one interprets F(xi , yi ; w) − F(xi , y; w) as a log odds ratio of an exponential family model (Smola and Hofmann, 2003), then the margin constraints may be dominated by incorrect values y that are exponentially less likely than the target value. [sent-208, score-0.331]
34 Since the algorithm operates on the dual program, we will first derive the Wolfe dual for the various soft margin formulations. [sent-225, score-0.306]
35 1 Dual Programs We will denote by α(iy) the Lagrange multiplier enforcing the margin constraint for label y = yi and example (xi , yi ). [sent-227, score-0.538]
36 Using standard Lagragian duality techniques, one arrives at the following dual quadratic program (QP). [sent-228, score-0.167]
37 Proposition 4 The objective of the dual problem of SVM0 from Equation (6) is given by Θ(α) ≡ − 1 ∑ ∑ α α ¯J ¯ + ∑ α , 2 i,y=yi j,¯ =y j (iy) ( jy) (iy)( jy) i,y=yi (iy) y where J(iy)( jy) = δΨi (y), δΨ j (¯ ) . [sent-229, score-0.165]
38 The dual QP can be formulated as y ¯ α∗ = argmax Θ(α), α 1460 s. [sent-230, score-0.186]
39 L ARGE M ARGIN M ETHODS FOR S TRUCTURED AND I NTERDEPENDENT O UTPUT VARIABLES Proof (sketch) Forming the Lagrangian function and eliminating the primal variables w by using the optimality condition w∗ (α) = ∑ ∑ α(iy) δΨi (y) i y=yi directly leads to the above dual program. [sent-233, score-0.165]
40 Notice that the J function that generates the quadratic from in the dual objective can be computed from inner products involving values of Ψ, which is a simple consequence of the linearity of the inner product. [sent-234, score-0.165]
41 Proposition 5 The dual problem to SVM1 is given by the program in Proposition 4 with additional constraints C ∑ α(iy) ≤ n , ∀i = 1, . [sent-237, score-0.242]
42 Proposition 6 The dual problem to SVM2 is given by the program in Proposition 4 with modified kernel function n J(iy)( jy) ≡ δΨi (y), δΨ j (¯ ) + δ(i, j) . [sent-242, score-0.167]
43 y ¯ C In the non-separable case with slack re-scaling, the loss function is introduced in the constraints for linear penalties and in the kernel function for quadratic penalties. [sent-243, score-0.304]
44 Proposition 7 The dual problem to SVM1 s is given by the program in Proposition 4 with additional constraints α(iy) C ∑ (yi , y) ≤ n , ∀i = 1, . [sent-244, score-0.242]
45 y=yi Proposition 8 The dual problem to SVM2 s is given by the program in Proposition 4 with modified kernel function n J(iy)( jy) = δΨi (y), δΨ j (¯ ) + δ(i, j) y . [sent-248, score-0.167]
46 More precisely, we will construct a nested sequence of successively tighter relaxations of the original problem using a cutting plane method (Kelley, 1960), implemented as a variable selection approach in the dual formulation. [sent-252, score-0.282]
47 We will later show that this is a valid strategy, since there always exists a polynomially sized subset of constraints so that the solution of the relaxed problem defined by this subset fulfills all constraints from the full optimization problem up to a precision of ε. [sent-255, score-0.182]
48 We will base the optimization on the dual program formulation which has two important advantages over the primal QP. [sent-257, score-0.246]
49 Second, the constraint matrix of the dual program supports a natural problem decomposition. [sent-259, score-0.217]
50 Iterating through the training examples (xi , yi ), the algorithm proceeds by finding the ˆ (potentially) “most violated” constraint for xi , involving some output value y. [sent-264, score-0.29]
51 If the (appropriately scaled) margin violation of this constraint exceeds the current value of ξi by more than ε, the dual ˆ variable corresponding to y is added to the working set. [sent-265, score-0.272]
52 This variable selection process in the dual program corresponds to a successive strengthening of the primal problem by a cutting plane that cuts off the current primal solution from the feasible set. [sent-266, score-0.361]
53 To apply the algorithm, it is sufficient to implement the feature mapping Ψ(x, y) (either explicitly or via a joint kernel function), the loss function (yi , y), as well as the maximization in step 6. [sent-293, score-0.175]
54 1463 T SOCHANTARIDIS , J OACHIMS , H OFMANN AND A LTUN In the slack re-scaling setting, it turns out that for a given example (xi , yi ) we need to identify the maximum over ˆ y ≡ argmax{(1 − w, δΨi (y) ) (yi , y)} . [sent-296, score-0.322]
55 For example, in the case of grammar learning with the F1 score as the loss function via (yi , y) = (1 − F1 (yi , y)), the maximum can be computed using a modified version of the CKY algorithm. [sent-299, score-0.27]
56 (9) y∈Y In cases where the loss function has an additive decomposition that is compatible with the feature map, one can fold the loss function contribution into the weight vector w , δΨi (y) = w, δΨi (y) − (yi , y) for some w . [sent-302, score-0.234]
57 The second best solution is necessary to detect y ˆ margin violations in cases where y = yi , but w, δΨi (˜ ) < 1. [sent-307, score-0.287]
58 In the case of grammar learning, for example, we can use any existing parser that returns the two highest scoring parse trees. [sent-309, score-0.252]
59 3 Correctness and Complexity of the Algorithm What we would like to accomplish first is to obtain a lower bound on the achievable improvement of the dual objective by selecting a single variable α(iˆ ) and adding it to the dual problem (cf. [sent-313, score-0.283]
60 Proposition 14 (SVM2 s ) For SVM2 s step 10 in Algorithm 1 the improvement δΘ of the dual objective is lower bounded by δΘ ≥ 1 2 ε2 n, 2 i Ri + C where i ≡ max{ (yi , y)} and Ri ≡ max{ δΨi (y) } . [sent-342, score-0.165]
61 Proposition 15 (SVM2 m ) For SVM2 m step 10 in Algorithm 1 the improvement δΘ of the dual objective is lower bounded by δΘ ≥ 1 ε2 n , 2 R2 + C i where Ri = max δΨi (y) . [sent-348, score-0.165]
62 (yi , y) Proposition 16 (SVM1 s ) For SVM1 s step 10 in Algorithm 1 the improvement δΘ of the dual objective is lower bounded by δΘ ≥ min Cε ε2 , 2n 8 2 R2 i i where i = max{ (yi , y)} and Ri = max{ δΨi (y) } . [sent-350, score-0.165]
63 Si = 0, then we may need to reduce dual variables α(iy) in order to get some slack for increasing the newly added α(iy) n α(iˆ ) . [sent-357, score-0.254]
64 Hence after t constraints, the dual objective will be at least t times these increments. [sent-376, score-0.165]
65 The result follows from the fact that the dual objective is upper bounded by the minimum of the primal, which in turn can be bounded by C ¯ and 1 C ¯ for SVM∗ and SVM∗ respectively. [sent-377, score-0.165]
66 Now we can define a joint feature map for the multiclass problem by Ψ(x, y) ≡ Φ(x) ⊗ Λc (y) . [sent-408, score-0.205]
67 Proof For all yk ∈ Y : w, Ψ(x, yk ) = ∑D·K wr ψr (x, yk ) = ∑K ∑D v jd φd (x)δ( j, k) = ∑D vkd φd (x) = r=1 j=1 d=1 d=1 vk , Φ(x) . [sent-411, score-0.186]
68 2 A LGORITHMS It is usually assumed that the number of classes K in simple multiclass problems is small enough, so that an exhaustive search can be performed to maximize any objective over Y . [sent-414, score-0.17]
69 We can give this a simple interpretation: For each output feature λr a corresponding weight vector vr is introduced. [sent-440, score-0.167]
70 3 Label Sequence Learning The next problem we would like to formulate in the joint feature map framework is the problem of label sequence learning, or sequence segmentation/annotation. [sent-486, score-0.256]
71 Hence each sequence of labels is considered to be a class of its own, resulting in a multiclass classification problem with |Σ|T different classes. [sent-498, score-0.187]
72 To model label sequence learning in this manner would of course not be very useful, if one were to apply standard multiclass classification methods. [sent-499, score-0.233]
73 1 M ODELING Inspired by hidden Markov model (HMM) type of interactions, we propose to define Ψ to include interactions between input features and labels via multiple copies of the input features as well as features that model interactions between nearby label variables. [sent-503, score-0.229]
74 In particular, in order to find the best label sequence y = yi , one can perform Viterbi decoding (Forney Jr. [sent-526, score-0.296]
75 Viterbi decoding can also be used with other loss functions by computing the maximization for all possible values of the loss function. [sent-528, score-0.186]
76 For a given pair of sequences 1475 T SOCHANTARIDIS , J OACHIMS , H OFMANN AND A LTUN x ∈ Σ∗ and y ∈ Σ∗ , alignment methods like the Smith-Waterman algorithm select the sequence of operations (e. [sent-536, score-0.252]
77 For each native sequence xi there is a most similar homologous sequence yi along with the optimal alignment ai . [sent-546, score-0.593]
78 The goal is to learn a discriminant function f that recognizes the homologous sequence among the decoys. [sent-552, score-0.197]
79 In our approach, this corresponds to finding a weight vector w so that homologous sequences align to their native sequence with high score, and that the alignment scores for the decoy sequences are lower. [sent-553, score-0.525]
80 , yk } as the output space for the i-th example, i i we seek a w so that w, Ψ(xi , yi , ai ) exceeds w, Ψ(xi , yti , a) for all t and a. [sent-557, score-0.302]
81 This implies a zero-one loss and hypotheses of the form f (xi ) = argmax max w, Ψ(x, y, a) . [sent-558, score-0.161]
82 y∈Yi a (15) The design of the feature map Ψ depends on the set of operations used in the sequence alignment algorithm. [sent-559, score-0.253]
83 2 A LGORITHMS In order to find the optimal alignment between a given native sequence x and a homologous/decoy sequence y as the solution of max w, Ψ(x, y, a) , a (16) we can use dynamic programming as e. [sent-562, score-0.332]
84 To solve the argmax in Equation (15), we assume that the number k of decoy sequences is small enough, so that we can select among the scores computed in Equation (16) via exhaustive search. [sent-565, score-0.165]
85 We assume that each node in the tree corresponds to the application of a context-free grammar rule. [sent-576, score-0.254]
86 Grammar rules are of the form nl [Ci → C j ,Ck ] or nl [Ci → xt ], where Ci ,C j ,Ck ∈ N are non-terminal symbols, and xt ∈ T is a terminal symbol. [sent-584, score-0.188]
87 A particular kind of weighted context-free grammar are probabilistic context-free grammars (PCFGs), where this weight wl is the log-probability of expanding node Hi with rule nl . [sent-586, score-0.327]
88 In PCFGs, the individual node probabilities are assumed to be independent, so that the probability P(x, y) of sequence x and tree y is the product of the node probabilities in the tree. [sent-587, score-0.214]
89 The most likely parse tree to yield x from a designated start symbol is the predicted label h(x). [sent-588, score-0.215]
90 This leads to the following maximization problem, where we use rules(y) to denote the multi-set of nodes in y, h(x) = argmax P(y|x) = argmax y∈Y y∈Y ∑ wl . [sent-589, score-0.178]
91 Ψ(x, y) contains one feature fi jk for each node of type ni jk [Ci → C j ,Ck ] and one feature fit for each node of type nit [Ci → xt ]. [sent-591, score-0.234]
92 2 As a suitable loss function , we have used a tree loss function which defines the loss between two classes y and y as the height of the first common ancestor of y and y in the taxonomy. [sent-636, score-0.343]
93 The results are summarized in Table 1 and show that the proposed hierarchical SVM learning architecture improves performance over the standard multiclass SVM in terms of classification accuracy as well as in terms of the tree loss. [sent-637, score-0.187]
94 Table 3 shows that all SVM formulations perform comparably, attributed to the fact the vast majority of the support label sequences end up having Hamming distance 1 to the correct label sequence. [sent-679, score-0.197]
95 3 Sequence Alignment To analyze the behavior of the algorithm for sequence alignment, we constructed a synthetic dataset according to the following sequence and local alignment model. [sent-682, score-0.269]
96 To generate the homologous sequence, we generate an alignment string of length 30 consisting of 4 characters “match”, “substitute”, “insert” , “delete”. [sent-687, score-0.216]
97 The homologous sequence is created by applying the alignment string to a randomly selected substring of the native. [sent-694, score-0.28]
98 We model this problem using local sequence alignment with the Smith-Waterman algorithm. [sent-696, score-0.205]
99 P(x ,z ) i j sequence alignment model, where the substitution matrix is computed as Πi j = log P(xi )P(z j ) (see e. [sent-768, score-0.205]
100 We demonstrated that our approach is very general, covering problems from natural language parsing and label sequence learning to multilabel classification and classification with output features. [sent-865, score-0.3]
wordName wordTfidf (topN-words)
[('iy', 0.414), ('ltun', 0.2), ('oachims', 0.2), ('ofmann', 0.2), ('sochantaridis', 0.2), ('nterdependent', 0.188), ('tructured', 0.188), ('utput', 0.188), ('yi', 0.186), ('arge', 0.158), ('argin', 0.149), ('grammar', 0.147), ('alignment', 0.141), ('slack', 0.136), ('jy', 0.125), ('multiclass', 0.123), ('dual', 0.118), ('yt', 0.111), ('ethods', 0.109), ('parse', 0.105), ('maxy', 0.1), ('loss', 0.093), ('pcfg', 0.088), ('collins', 0.085), ('proposition', 0.084), ('parsing', 0.076), ('homologous', 0.075), ('constraints', 0.075), ('hofmann', 0.075), ('svm', 0.07), ('margin', 0.07), ('argmax', 0.068), ('ri', 0.067), ('taskar', 0.065), ('vr', 0.065), ('tree', 0.064), ('sequence', 0.064), ('argmaxy', 0.063), ('jrr', 0.063), ('native', 0.063), ('odeling', 0.063), ('parseness', 0.063), ('yk', 0.062), ('language', 0.06), ('formulations', 0.058), ('discriminant', 0.058), ('cutting', 0.056), ('output', 0.054), ('grammars', 0.053), ('viterbi', 0.053), ('altun', 0.053), ('xt', 0.052), ('structured', 0.051), ('tsochantaridis', 0.051), ('cky', 0.05), ('decoy', 0.05), ('manning', 0.05), ('constraint', 0.05), ('program', 0.049), ('feature', 0.048), ('sequences', 0.047), ('primal', 0.047), ('objective', 0.047), ('hmm', 0.047), ('label', 0.046), ('si', 0.045), ('plane', 0.044), ('node', 0.043), ('nl', 0.042), ('acc', 0.042), ('wl', 0.042), ('xti', 0.042), ('interactions', 0.039), ('weston', 0.039), ('align', 0.038), ('duffy', 0.038), ('eneral', 0.038), ('oss', 0.038), ('schuetze', 0.038), ('taxonomies', 0.038), ('sentence', 0.037), ('crammer', 0.036), ('perceptron', 0.035), ('features', 0.035), ('violated', 0.034), ('working', 0.034), ('entity', 0.034), ('qp', 0.034), ('ys', 0.034), ('joint', 0.034), ('notice', 0.034), ('generative', 0.033), ('optimization', 0.032), ('joachims', 0.032), ('cornell', 0.032), ('insertion', 0.032), ('wta', 0.032), ('violations', 0.031), ('lgorithms', 0.031), ('score', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
2 0.098982513 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
Author: S. Sathiya Keerthi, Dennis DeCoste
Abstract: This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight , SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber’s loss function and the L1 loss function, and also for solving ordinal regression. Keywords: linear SVMs, classification, conjugate gradient
3 0.096794434 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
4 0.075920485 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
5 0.070787363 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
6 0.070394039 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
7 0.065119766 36 jmlr-2005-Gaussian Processes for Ordinal Regression
8 0.062921017 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
9 0.057562873 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
10 0.056470644 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
11 0.056315556 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
12 0.052057534 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
13 0.051095873 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
14 0.047853094 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
15 0.047377873 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
16 0.045812629 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
17 0.045295462 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
18 0.043950155 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
19 0.042955205 29 jmlr-2005-Efficient Margin Maximizing with Boosting
20 0.041404966 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
topicId topicWeight
[(0, 0.268), (1, 0.095), (2, 0.217), (3, -0.062), (4, -0.078), (5, -0.137), (6, 0.059), (7, -0.048), (8, -0.134), (9, 0.032), (10, 0.112), (11, 0.075), (12, 0.091), (13, 0.076), (14, 0.138), (15, -0.045), (16, 0.019), (17, 0.179), (18, 0.118), (19, 0.122), (20, 0.187), (21, -0.051), (22, 0.043), (23, -0.09), (24, 0.029), (25, -0.028), (26, 0.008), (27, -0.034), (28, -0.071), (29, 0.094), (30, 0.032), (31, 0.018), (32, -0.046), (33, 0.116), (34, -0.114), (35, -0.067), (36, 0.138), (37, 0.218), (38, 0.06), (39, -0.181), (40, 0.131), (41, 0.106), (42, 0.032), (43, 0.117), (44, -0.026), (45, -0.017), (46, -0.03), (47, -0.227), (48, -0.245), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.93089908 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
2 0.52759999 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
3 0.40586308 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
4 0.40130389 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
Author: S. Sathiya Keerthi, Dennis DeCoste
Abstract: This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text classification. This is done by modifying the finite Newton method of Mangasarian in several ways. Experiments indicate that the method is much faster than decomposition methods such as SVMlight , SMO and BSVM (e.g., 4-100 fold), especially when the number of examples is large. The paper also suggests ways of extending the method to other loss functions such as the modified Huber’s loss function and the L1 loss function, and also for solving ordinal regression. Keywords: linear SVMs, classification, conjugate gradient
5 0.34320039 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
6 0.3328377 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
7 0.28331721 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
8 0.27466527 36 jmlr-2005-Gaussian Processes for Ordinal Regression
9 0.26881197 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
10 0.24087639 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
11 0.2354297 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
12 0.23104957 3 jmlr-2005-A Classification Framework for Anomaly Detection
13 0.22764972 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
14 0.22631247 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
15 0.21701206 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.20884793 29 jmlr-2005-Efficient Margin Maximizing with Boosting
17 0.20840366 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
18 0.20473126 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
19 0.20356974 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
20 0.20315805 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
topicId topicWeight
[(13, 0.013), (17, 0.018), (19, 0.499), (36, 0.02), (37, 0.037), (42, 0.019), (43, 0.047), (47, 0.03), (52, 0.08), (59, 0.02), (70, 0.019), (88, 0.067), (90, 0.014), (94, 0.046)]
simIndex simValue paperId paperTitle
1 0.89905435 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
same-paper 2 0.85954869 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
3 0.43748808 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
Author: Marianthi Markatou, Hong Tian, Shameek Biswas, George Hripcsak
Abstract: This paper brings together methods from two different disciplines: statistics and machine learning. We address the problem of estimating the variance of cross-validation (CV) estimators of the generalization error. In particular, we approach the problem of variance estimation of the CV estimators of generalization error as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and test sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and test sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random T T variables Y = Card(S j S j ) and Y ∗ = Card(Sc Sc ), where S j , S j are two training sets, and Sc , j j j c are the corresponding test sets. We prove that the distribution of Y and Y* is hypergeometric Sj and we compare our estimator with the one proposed by Nadeau and Bengio (2003). We extend these results in the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case. We illustrate the results through simulation. Keywords: cross-validation, generalization error, moment approximation, prediction, variance estimation
4 0.41132805 20 jmlr-2005-Clustering with Bregman Divergences
Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory
5 0.40499446 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
6 0.37827647 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
7 0.37781319 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
8 0.37688881 39 jmlr-2005-Information Bottleneck for Gaussian Variables
9 0.37046552 64 jmlr-2005-Semigroup Kernels on Measures
10 0.36575812 5 jmlr-2005-A Generalization Error for Q-Learning
11 0.36220646 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
12 0.35499561 29 jmlr-2005-Efficient Margin Maximizing with Boosting
13 0.35426342 71 jmlr-2005-Variational Message Passing
14 0.34800237 48 jmlr-2005-Learning the Kernel Function via Regularization
15 0.34263512 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
16 0.34084859 36 jmlr-2005-Gaussian Processes for Ordinal Regression
17 0.33859721 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.3341676 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
19 0.33269498 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
20 0.33055705 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning