jmlr jmlr2008 jmlr2008-30 knowledge-graph by maker-knowledge-mining

30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers


Source: pdf

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. [sent-14, score-0.457]

2 Let each object t ∈ T be characterized by an observation x t and a label yt which take values from a finite set X and Y , respectively. [sent-19, score-0.384]

3 Each object t ∈ T is assigned a function qt : Y × X → R which determines a quality of a label yt given an observation xt . [sent-21, score-0.649]

4 Each object pair {t,t } ∈ E is assigned a function gtt : Y × Y → R ∗. [sent-22, score-0.522]

5 e F RANC AND S AVCHYNSKYY which determines the quality of labels yt and yt . [sent-25, score-0.805]

6 We adopt the convention gtt (y, y ) = gt t (y , y). [sent-26, score-0.499]

7 2 Let q ∈ R|T ||X ||Y | and g ∈ R|E ||Y | be ordered tuples which contain elements qt (y, x), t ∈ T , x ∈ X , y ∈ Y and gtt (y, y ), {t,t } ∈ E , y, y ∈ Y , respectively. [sent-27, score-0.64]

8 We consider a class of structured classifiers f : X T → Y T parametrized by (q, g) that predict labeling y from observations x by selecting the labeling with the maximal quality, that is, f (x; q, g) = argmax F(x, y; q, g) = argmax y∈Y T y∈Y T ∑ qt (yt , xt ) + ∑ t∈T gtt (yt , yt ) . [sent-28, score-1.401]

9 (2006) showed that the projection step is tractable for the max-sum classifiers with supermodular functions g and binary labels |Y | = 2. [sent-90, score-0.376]

10 Their approach is based on solving the underlying SVM QP task by a cutting plane algorithm which requires as a subroutine an algorithm solving the LAC task for the particular classifier. [sent-93, score-0.406]

11 The authors suggested to replace an exact solution of the LAC task required in the cutting plane algorithm by its polynomially solvable LP relaxation. [sent-99, score-0.348]

12 Namely, the contributions of the article are as follows: • We formulate and solve the problem of learning the supermodular max-sum classifier with an arbitrary neighbourhood structure E and without any restriction on the number of labels |Y | (Taskar et al. [sent-103, score-0.496]

13 (2005) such that it maintains the quality functions g supermodular during the course of the algorithm thus making the LAC task efficiently solvable. [sent-107, score-0.492]

14 It is known, that the class of problems with a trivial equivalent contains acyclic problems (Schlesinger, 1976) and supermodular problems (Schlesinger and Flach, 2000). [sent-113, score-0.492]

15 We will extend this result to show that problems with a strictly trivial equivalent which we can learn contain the acyclic and the supermodular max-sum problems with a unique solution. [sent-114, score-0.576]

16 We propose variants of the perceptron and of the cutting plane algorithm for learning from a consistent and an inconsistent training set, respectively. [sent-116, score-0.435]

17 We show that the learning problem can be solved approximately by a variant of the cutting plane algorithm proposed by Finley and Joachims (2005) which uses the LP relaxation to solve the LAC task approximately. [sent-120, score-0.415]

18 Node (t, y) is called maximal if qt (y, xt ) = maxy∈Y qt (y, xt ) and edge {(t, y), (t , y )} is called maximal if gtt (y, y ) = maxy,y ∈Y gtt (y, y ). [sent-170, score-1.57]

19 Let U(x; q, g) be a height (also called energy) of the max-sum problem P defined as a sum of qualities of the maximal nodes and the maximal edges, that is, U(x; q, g) = ∑ max qt (y, xt ) + ∑ y∈Y t∈T {t,t }∈E max gtt (y, y ) . [sent-175, score-0.91]

20 2 Supermodular Max-Sum Problems Definition 1 A function gtt : Y × Y → R is supermodular if 1. [sent-193, score-0.846]

21 For each four-tuple (yt , yt , yt , yt ) ∈ Y 4 of labels such that yt > yt and yt > yt the following inequality holds: gtt (yt , yt ) + gtt (yt , yt ) ≥ gtt (yt , yt ) + gtt (yt , yt ). [sent-203, score-6.017]

22 (7) A max-sum problem in which the label set Y is totally ordered and all the functions g are supermodular is called a supermodular max-sum problem. [sent-204, score-0.694]

23 A naive approach to check whether a given max-sum problem is supermodular amounts to verifying all |E | · |Y |4 inequalities in (7). [sent-207, score-0.404]

24 (9) In this paper, we exploit the fact that the optimal solution of the supermodular problems can be found in a polynomial time (Schlesinger and Flach, 2000). [sent-211, score-0.375]

25 Kolmogorov and Zabih (2002) proposed an efficient algorithm for the binary case |Y | = 2 which is based on transforming the supermodular max-sum problem to the max-flow problem from the graph theory. [sent-212, score-0.347]

26 We will propose learning algorithms which require an arbitrary solver for the supermodular max-sum problem as a subroutine. [sent-214, score-0.396]

27 It was shown (Schlesinger, 1976; Schlesinger and Flach, 2000) that problems with a trivial equivalent contain two well-known polynomially solvable subclasses of max-sum problems: (i) the problems with an acyclic neighborhood structure E and (ii) the supermodular problems. [sent-219, score-0.545]

28 In particular, we will show that the class of problems with a strictly trivial equivalent contains all acyclic and supermodular problems which have a unique solution. [sent-221, score-0.576]

29 Recall that the max-sum problem P has a strictly trivial equivalent if there exists a strictly trivial problem which is equivalent to P. [sent-235, score-0.372]

30 Finally, we give a theorem which asserts that the class of problems with a strictly trivial equivalent includes at least the acyclic problems and the supermodular problems which have a unique solution. [sent-236, score-0.576]

31 If (T , E ) is an acyclic graph or quality functions g are supermodular then P is equivalent to some strictly trivial problem. [sent-238, score-0.63]

32 In Section 4 we will show, however, how to use the perceptron for learning the strictly supermodular max-sum classifiers and the max-sum classifiers with a strictly trivial equivalent. [sent-281, score-0.758]

33 (15) Similarly to the perceptron algorithm, the cutting plane algorithm is applicable for learning of an arbitrary linear classifier for which the LAC task (15) can be solved efficiently. [sent-310, score-0.462]

34 In case of a supermodular max-sum classifier we will augment the learning problem (13) in such a way that the obtained quality functions will be supermodular and thus the task (15) becomes solvable precisely in a polynomial time. [sent-314, score-0.892]

35 Next, we will derive learning algorithms for strictly supermodular maxsum classifiers and max-sum classifiers with a strictly trivial equivalent. [sent-327, score-0.617]

36 2 Strictly Supermodular Max-Sum Classifier Problem 2 (Learning strictly supermodular max-sum classifier from a consistent training set). [sent-345, score-0.465]

37 Similarly to the general case, we reformulate Problem 2 as an equivalent problem of solving a set of strict linear inequalities F(x j , y j ; q, g) > F(x j , y; q, g) , j ∈ J , y ∈ Y T \{y j } , gtt y, y + gtt y + 1, y + 1 > gtt y, y + 1 + gtt y + 1, y , 2 {t,t } ∈ E , (y, y ) ∈ 1, . [sent-347, score-2.077]

38 The inequalities (18b), if satisfied, guarantee that the quality functions g are strictly supermodular (c. [sent-353, score-0.542]

39 Thus we use the perceptron algorithm to repeatedly solve the inequalities (18b) and, as soon as (18b) are satisfied, we find a violated inequality in (18a) by solving a supermodular max-sum problem. [sent-361, score-0.668]

40 The proposed variant of the perceptron to solve (18) is as follows: Algorithm 2 Learning strictly supermodular max-sum classifier by perceptron 1: Set q := 0 and g := 0. [sent-362, score-0.735]

41 2: Find a violated inequality in (18b), that is, find a quadruple {t,t } ∈ E , y, y ∈ Y such that gtt (y, y ) + gtt (y + 1, y + 1) − gtt (y, y + 1) − gtt (y + 1, y ) ≤ 0 . [sent-363, score-2.073]

42 Otherwise use the found (t,t , y, y ) to update current g by gtt (y, y ) + 1 , gtt (y + 1, y + 1) := gtt (y + 1, y + 1) + 1 , gtt (y, y ) := gtt (y + 1, y ) − 1 , gtt (y + 1, y ) := gtt (y, y + 1) := gtt (y, y + 1) − 1 , and go to Step 2. [sent-365, score-3.992]

43 Otherwise use the found ( j, y) to update current q and g by gtt (yt , yt ) := gtt (yt , yt ) + 1, gtt (yt , yt ) := gtt (yt , yt ) − 1 , {t,t } ∈ E , j j j j j j qt (yt , xt ) := qt (yt , xt ) + 1, qt (yt , xt ) := qt (yt , xt ) − 1 , t ∈ T . [sent-368, score-4.284]

44 Thus Algorithm 2 terminates in a finite number of iterations provided (18) is satisfiable, that is, if there exists a supermodular max-sum classifier f (•; q, g) with zero empirical error risk on the training set L . [sent-371, score-0.415]

45 The proposed variant of the perceptron to solve (20) is as follows: Algorithm 3 Learning the max-sum classifier with a strictly equivalent by perceptron ϕ 1: Set g := 0, q := 0 ,ϕ j := 0 , j ∈ J . [sent-387, score-0.388]

46 2: Find a violated inequality in (20a), that is, find a triplet j ∈ J , t ∈ T , y ∈ Y \ {yt } such that j j ∑ j qt (yt , xt ) − t ∈N (t) ϕtt (yt ) ≤ qt (y, xt ) − j j ∑ j t ∈N (t) ϕtt (y) . [sent-388, score-0.499]

47 Otherwise update q and ϕ j by ϕtt (yt ) := ϕtt (yt ) − 1 , ϕtt (y) := ϕtt (y) + 1 , j j j j j j qt (yt , xt ) := qt (yt , xt ) + 1 , qt (y, xt ) := qt (y, xt ) − 1 . [sent-390, score-0.844]

48 j j j j j 81 j t ∈ N (t) , F RANC AND S AVCHYNSKYY 4: Find a violated inequality in (20b), that is, find a quintuple j ∈ J , {t,t } ∈ E , (y, y ) ∈ Y 2 \ j j {(yt , yt )} such that gtt (yt , yt ) + ϕtt (yt ) + ϕt t (yt ) ≤ gtt (y, y ) + ϕtt (y) + ϕt t (y ) . [sent-391, score-1.797]

49 Otherwise update g and ϕ j by ϕtt (yt ) := ϕtt (yt ) + 1 , ϕt t (yt ) := ϕt t (yt ) + 1 , j j j j ϕt t (y ) := ϕt t (y ) − 1 , ϕtt (y) := ϕtt (y) − 1 , j j j j gtt (yt , yt ) := gtt (yt , yt ) + 1 , gtt (y, y ) := gtt (y, y ) − 1 . [sent-393, score-2.718]

50 We will formulate the learning problems for a general max-sum classifier, a supermodular max-sum classifier, and a max-sum classifier with a strictly trivial equivalent. [sent-399, score-0.533]

51 In case of a general max-sum classifier and a supermodular max-sum classifier, we will use the additive loss function L∆ (y, y ) = ∑t∈T Lt (yt , yt ) where Lt : Y × Y → R is any function which satisfies Lt (y, y ) = 0 for y = y and Lt (y, y ) > 0 otherwise. [sent-409, score-0.708]

52 Further, we will prove that when the LAC task is solved exactly then the cutting plane algorithm finds an ξ ξ∗ arbitrary precise solution QP (w,ξ ) − QP (w∗ ,ξ ) ≤ ε, ε > 0, after a finite number of iterations. [sent-433, score-0.349]

53 2 Supermodular Max-Sum Classifier Problem 5 (Learning a supermodular max-sum classifier from an inconsistent training set). [sent-437, score-0.465]

54 (23c) Compared to the task (21) of learning a general max-sum classifier, the task (23) defined for the supermodular max-sum classifier contains additional linear constraints (23c). [sent-442, score-0.557]

55 w, zi w, zi ≥ bi , i ∈ I0 , ≥ bi − ξ j , j ∈ J , i ∈ I j , where bi = 0, i ∈ I0 , and zi , i ∈ I0 , account for the added supermodular constraints (23c). [sent-447, score-0.648]

56 Thus the quality functions are always supermodular and the LAC task can be solved precisely by efficient polynomial-time algorithms. [sent-449, score-0.525]

57 This extension is necessary for learning a supermodular max-sum classifier. [sent-480, score-0.347]

58 The cutting plane algorithm tries to select the subsets I = I0 ∪ I 1 ∪ · · · ∪ I m such that (i) |I | is sufficiently small to make the reduced task (28) solvable by standard optimization packages and (ii) the obtained solution well approximates the original task. [sent-510, score-0.369]

59 The proposed extension of the cutting plane algorithm is as follows: Algorithm 4 A cutting plane algorithm for QP task (26) 1: Select arbitrarily I j := {u j }, u j ∈ I j , j ∈ J . [sent-511, score-0.443]

60 First, the task (29) can be solved precisely which applies to learning of supermodular max-sum classifiers and max-sum classifiers with a strictly trivial equivalent. [sent-522, score-0.657]

61 (33) By adding (33) to (32) and using w = ∑i∈I αi zi , ξ j = bu j − w, zu j we get C bu j − w, zu j j∈J m ∑ ∑ αi (bi − − w, zi ) − i∈I j C m ∑ j∈J ∑ αi (bi − w, zi ) ≤ ε i∈I 0 bu j − w, zu j − ∑ αi (bi − w, zi ) ≤ ε i∈I C m ∑ ξ j − ∑ αi b i + j∈J w, w ≤ ε. [sent-559, score-0.499]

62 The LAC task (29) required in Step 2a of Algorithm 4 amounts to solving an instance of a general max-sum problem ˆ y j = argmax L∆ (y, y j ) − F(x j , y j ; q, g) + F(x j , y; q, g) y∈Y T = argmax y∈Y T ∑ t∈T j j qt (yt , xt ) + [[yt = yt ]] + ∑ gtt (yt , yt ) . [sent-570, score-1.619]

63 Since the QP task (24) involves the primal constraints w, z i ≥ bi , i ∈ I0 , without slack variables the corresponding zi , i ∈ I0 , must be included in the reduced QP task during the entire progress of Algorithm 4. [sent-582, score-0.379]

64 We compare a general max-sum classifier, a supermodular max-sum classifier and a standard multi-class SVMs classifier (Crammer and Singer, 2001) as a baseline approach. [sent-607, score-0.347]

65 In particular, we used the formulations given in Problem 4 and Problem 5 for learning the general and supermodular classifiers from the inconsistent training sets. [sent-608, score-0.465]

66 Each pixel t ∈ T is characterized by its observable state xt and a label yt . [sent-631, score-0.431]

67 For the supermodular max-sum classifier the required precision was always attained since the LAC can be solved exactly. [sent-661, score-0.405]

68 Though we have not fully exploited this possibility, the training time for the supermodular max-sum classifier can be considerably reduced by using min-cut/max-flow algorithms to solve the LAC task. [sent-665, score-0.424]

69 For this reason the supermodular max-sum classifier is favourable when the training time or the classification time is of importance. [sent-667, score-0.381]

70 Due to a different size of images it is convenient to present errors in terms of the normalized additive loss function L(y, y ) = 100 ∑t∈T [[yt = yt ]] cor|T | responding to percentage of misclassified pixels in the image. [sent-670, score-0.385]

71 Performances of the general and the supermodular max-sum classifiers are almost identical. [sent-673, score-0.347]

72 This shows a good potential of the learning algorithm to extend applicability of the polynomially solvable class of supermodular max-sum problems. [sent-674, score-0.4]

73 Note that selecting a proper supermodular function by hand is difficult due to an unintuitive form of the supermodularity conditions (7). [sent-675, score-0.394]

74 The quality functions q and g are assumed to be homogeneous, that is, qt (x, y) = q(x, y) and gtt (y, y ) = g(y, y ). [sent-716, score-0.694]

75 We have proposed variants of the perceptron and the cutting plane algorithm for learning supermodular max-sum classifiers whose response can be computed efficiently in polynomial time. [sent-796, score-0.692]

76 The perceptron and the cutting plane algorithm are modified such that added constraints on supermodularity are maintained satisfied during the course of optimization. [sent-798, score-0.392]

77 We have showed that this class covers at least acyclic and supermodular max-sum classifiers with a unique solution. [sent-801, score-0.39]

78 As a result, the perceptron and the cutting plane algorithms do not require to call any max-sum solver during the course of optimization. [sent-803, score-0.345]

79 Prior to proving the theorem, we will introduce the concept of local consistency (Schlesinger, 1976) and the theorem asserting that the problems with a trivial equivalent contain the acyclic and supermodular problems (Schlesinger, 1976; Schlesinger and Flach, 2000). [sent-819, score-0.492]

80 If (T , E ) is an acyclic graph or quality functions g are supermodular then P is equivalent to a trivial problem. [sent-824, score-0.546]

81 Recall, that P with the optimal solution y∗ ∈ Y T has a trivial equivalent Pϕ if there exist potentials ϕ such that the following set of linear inequalities holds ϕ ϕ qt (yt∗ , xt ) ≥ qt (y, xt ) , t ∈ T , y ∈ Y \ {yt∗ } , ϕ ∗ , y∗ ) ≥ gϕ (y , y ) , {t,t } ∈ E , (y, y ) ∈ Y 2 \ {(y∗ , y∗ )} . [sent-825, score-0.662]

82 gtt (yt t t t t t tt (36) The system (36) differs from the definition of problems with a strictly trivial equivalent (19) just by using the non-strict inequalities. [sent-826, score-0.807]

83 Then only two cases can occur: (i) Pϕ is strictly trivial or (ii) there is at least one maximal locally inconsistent node or edge. [sent-828, score-0.345]

84 Lemma 3 Let P be a supermodular problem with an unique solution y ∗ and let Pϕ be a trivial equivalent of P. [sent-837, score-0.477]

85 Let Yt 0 = {(t, y) | qt (y, xt ) = maxy ∈Y qt (y , xt )} denote a set of all maximal nodes corresponding to the object t ∈ T . [sent-840, score-0.57]

86 Let us construct the labeling y h = (yth | t ∈ T ) composed of the highest maximal nodes; (t, yth ) is the highest maximal node if (t, yth ) ∈ Yt 0 and yth > y for all (t, y) ∈ Yt 0 \ {t, yt0 }. [sent-841, score-0.652]

87 Recall, that the labels are fully ordered for the supermodular problems. [sent-842, score-0.376]

88 Now, we show that the labeling yh is optimal since all its edges {(t, yth ), (t , yth )} ∈ EY are also maximal. [sent-843, score-0.395]

89 Then, by assumption of local consistency, there exist edges {(t, yth ), (t , yt )} and {(t, yt ), (t , yth )} which are maximal. [sent-845, score-1.038]

90 Note that yt < yth and yt < yth because (t, yth ) and (t , yth ) are the highest nodes. [sent-846, score-1.286]

91 From the condition of supermodularity (7) and (4a) we have 0 ≤ gtt (yth , yth ) + gtt (yt , yt ) − gtt (yth , yt ) − gtt (yt , yth ) ϕ ϕ ϕ ϕ = gtt (yth , yth ) + gtt (yt , yt ) − gtt (yth , yt ) − gtt (yt , yth ) . [sent-847, score-6.047]

92 This condition, however, cannot be satisfied if the edge {(t, yth ), (t , yth )} is not maximal which is a contradiction. [sent-848, score-0.357]

93 Proof of Theorem 3: Let P be an acyclic or supermodular max-sum problem with the unique solution y∗ ∈ Y T . [sent-858, score-0.418]

94 We will introduce a procedure which changes the potentials ϕ in such a way that the inconsistent maximal node (t, yt0 ) or inconsistent maximal edge {(t, yt0 ), (t , yt0 )}, respectively, become non-maximal while other maximal (non-maximal) nodes or edges remain maximal (non-maximal). [sent-862, score-0.582]

95 Repeating this procedure for all inconsistent maximal nodes and edges makes the inequalities (36) satisfied strictly, that is, the problem Pϕ becomes strictly trivial which is to be proven. [sent-863, score-0.463]

96 The procedures for elimination of inconsistent nodes and edges read: Elimination of the inconsistent maximal node (t, yt0 ): Let t ∈ N (t) be such a neighbor of t that the set of edges Ett (yt0 ) = {{(t, yt0 ), (t , yt )} | yt ∈ Y } does not contain any maximal edge. [sent-864, score-1.135]

97 Let ε be a number computed as 1 ϕ ϕ ε= max g (y, y ) − max gtt (yt0 , y ) . [sent-865, score-0.499]

98 Adding ε to the potential ϕ ϕtt (yt0 ) := ϕtt (yt0 ) + ε decreases the quality qt (yt0 , xt ) = qt (yt0 , xt ) − ∑t ∈N (t) ϕtt (yt0 ) by ε and inϕ creases the qualities gtt (y, y ) = gtt (y, y ) + ϕtt (y) + ϕt t (y ), {(t, y), (t , y )} ∈ Ett (yt0 ) by ε. [sent-867, score-1.497]

99 Notice, that all edges from Et t (yt0 ) = {(t, yt ), (t , yt0 )} | yt ∈ Y } are locally inconsistent and they cannot be a part of the optimal solution. [sent-875, score-0.84]

100 Subtracting ε from the potential ϕt t (yt0 ) := ϕt t (yt0 ) − ε ϕ decreases the qualities gtt (y, y ) = gtt (y, y ) + ϕtt (y) + ϕt t (y ), {(t, y), (t , y )} ∈ Et t (yt0 ) by ε and ϕ 0 increases the quality qt (yt , xt ) = qt (yt0 , xt ) − ∑t ∈N (t ) ϕtt (yt0 ) by ε. [sent-878, score-1.497]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gtt', 0.499), ('yt', 0.361), ('supermodular', 0.347), ('lac', 0.223), ('qp', 0.211), ('lp', 0.169), ('schlesinger', 0.141), ('yth', 0.141), ('perceptron', 0.141), ('qt', 0.141), ('qd', 0.129), ('tt', 0.122), ('avchynskyy', 0.112), ('ranc', 0.112), ('iscriminative', 0.106), ('trivial', 0.102), ('sudoku', 0.1), ('cutting', 0.099), ('relaxation', 0.093), ('er', 0.092), ('task', 0.091), ('strictly', 0.084), ('inconsistent', 0.084), ('labeling', 0.079), ('plane', 0.077), ('puzzle', 0.076), ('maximal', 0.075), ('lassifiers', 0.074), ('um', 0.074), ('xt', 0.07), ('classi', 0.07), ('ax', 0.058), ('inequalities', 0.057), ('violated', 0.056), ('zi', 0.055), ('quality', 0.054), ('ers', 0.054), ('potentials', 0.053), ('solvable', 0.053), ('zu', 0.053), ('supermodularity', 0.047), ('neighbourhood', 0.045), ('acyclic', 0.043), ('neighbouring', 0.041), ('tsochantaridis', 0.04), ('ql', 0.04), ('bu', 0.04), ('taskar', 0.037), ('bi', 0.036), ('argmax', 0.036), ('earning', 0.035), ('halts', 0.035), ('landscape', 0.035), ('edges', 0.034), ('training', 0.034), ('risk', 0.034), ('solved', 0.033), ('discriminative', 0.032), ('article', 0.032), ('reads', 0.031), ('primal', 0.031), ('dual', 0.031), ('svm', 0.03), ('czech', 0.03), ('satisfaction', 0.029), ('labels', 0.029), ('solution', 0.028), ('solver', 0.028), ('response', 0.028), ('constraints', 0.028), ('nodes', 0.027), ('kolmogorov', 0.027), ('argmin', 0.027), ('slack', 0.026), ('structured', 0.025), ('precision', 0.025), ('ey', 0.025), ('images', 0.024), ('solving', 0.024), ('boykov', 0.023), ('ett', 0.023), ('finley', 0.023), ('novikoff', 0.023), ('werner', 0.023), ('svms', 0.023), ('qualities', 0.023), ('maxy', 0.023), ('object', 0.023), ('rd', 0.023), ('labelings', 0.022), ('solve', 0.022), ('stopping', 0.022), ('inequality', 0.021), ('reduced', 0.021), ('arbitrary', 0.021), ('flach', 0.02), ('subclass', 0.02), ('ordinary', 0.02), ('segmentation', 0.02), ('csp', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines

2 0.11412671 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values

3 0.081096873 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

4 0.06362658 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Author: Tianjiao Chu, Clark Glymour

Abstract: Pointwise consistent, feasible procedures for estimating contemporaneous linear causal structure from time series data have been developed using multiple conditional independence tests, but no such procedures are available for non-linear systems. We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efficiently and consistently using tests based on additive model regression. Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model specification procedure is able to extract relatively detailed causal information. We investigate the finite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. Several variants of the procedure are also discussed. Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices

5 0.05953449 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso

6 0.05726992 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

7 0.053267468 58 jmlr-2008-Max-margin Classification of Data with Absent Features

8 0.049461171 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

9 0.049105663 56 jmlr-2008-Magic Moments for Structured Output Prediction

10 0.048995495 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

11 0.048386376 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

12 0.044086963 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

13 0.042092834 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

14 0.040008668 54 jmlr-2008-Learning to Select Features using their Properties

15 0.039257489 9 jmlr-2008-Active Learning by Spherical Subdivision

16 0.038608506 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

17 0.037111878 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

18 0.036909446 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

19 0.036546696 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

20 0.03615943 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, -0.024), (2, 0.068), (3, -0.011), (4, -0.117), (5, -0.028), (6, 0.064), (7, 0.055), (8, -0.105), (9, -0.206), (10, -0.093), (11, -0.122), (12, -0.056), (13, -0.058), (14, -0.109), (15, -0.056), (16, -0.169), (17, 0.131), (18, -0.032), (19, 0.095), (20, -0.162), (21, 0.013), (22, 0.026), (23, -0.067), (24, 0.261), (25, 0.007), (26, 0.194), (27, 0.002), (28, -0.013), (29, 0.028), (30, 0.037), (31, -0.167), (32, 0.06), (33, 0.078), (34, -0.056), (35, -0.153), (36, -0.002), (37, -0.092), (38, 0.06), (39, -0.04), (40, 0.096), (41, -0.034), (42, 0.039), (43, 0.161), (44, -0.077), (45, 0.055), (46, -0.048), (47, 0.074), (48, -0.116), (49, 0.19)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92007768 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines

2 0.57374656 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values

3 0.42380914 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

Author: Tianjiao Chu, Clark Glymour

Abstract: Pointwise consistent, feasible procedures for estimating contemporaneous linear causal structure from time series data have been developed using multiple conditional independence tests, but no such procedures are available for non-linear systems. We describe a feasible procedure for learning a class of non-linear time series structures, which we call additive non-linear time series. We show that for data generated from stationary models of this type, two classes of conditional independence relations among time series variables and their lags can be tested efficiently and consistently using tests based on additive model regression. Combining results of statistical tests for these two classes of conditional independence relations and the temporal structure of time series data, a new consistent model specification procedure is able to extract relatively detailed causal information. We investigate the finite sample behavior of the procedure through simulation, and illustrate the application of this method through analysis of the possible causal connections among four ocean indices. Several variants of the procedure are also discussed. Keywords: conditional independence test, contemporaneous causation, additive model regression, Granger causality, ocean indices

4 0.41180652 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

5 0.32974935 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

Author: Shann-Ching Chen, Geoffrey J. Gordon, Robert F. Murphy

Abstract: In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as Markov random fields or factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models we can use. In this paper, we study a new class of potential functions, which we call decomposable k-way potentials, and provide efficient algorithms for computing messages from these potentials during belief propagation. We believe these new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. We discuss three instances of decomposable potentials: the associative Markov network potential, the nested junction tree, and a new type of potential which we call the voting potential. We use these potentials to classify images of protein subcellular location patterns in groups of cells. Classifying subcellular location patterns can help us answer many important questions in computational biology, including questions about how various treatments affect the synthesis and behavior of proteins and networks of proteins within a cell. Our new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy. Keywords: factor graphs, approximate inference algorithms, structured classification, protein subcellular location patterns, location proteomics

6 0.29886988 58 jmlr-2008-Max-margin Classification of Data with Absent Features

7 0.2730507 9 jmlr-2008-Active Learning by Spherical Subdivision

8 0.2662043 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

9 0.24997409 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

10 0.23043306 56 jmlr-2008-Magic Moments for Structured Output Prediction

11 0.22349633 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

12 0.22150752 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

13 0.21545869 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

14 0.20090525 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

15 0.20055182 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

16 0.19112289 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

17 0.18836318 52 jmlr-2008-Learning from Multiple Sources

18 0.18327847 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

19 0.18323974 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

20 0.1790359 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (5, 0.022), (21, 0.01), (31, 0.012), (40, 0.038), (54, 0.527), (58, 0.043), (66, 0.033), (76, 0.029), (88, 0.068), (92, 0.028), (94, 0.041), (99, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96669281 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms

same-paper 2 0.91679859 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines

3 0.84670407 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

4 0.63948995 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

5 0.54174852 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

6 0.53247797 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

7 0.48067108 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

8 0.47748375 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

9 0.45357102 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

10 0.45302558 83 jmlr-2008-Robust Submodular Observation Selection

11 0.4520897 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study

12 0.44711247 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

13 0.44302773 56 jmlr-2008-Magic Moments for Structured Output Prediction

14 0.44075948 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

15 0.4394879 86 jmlr-2008-SimpleMKL

16 0.43292797 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

17 0.42701581 58 jmlr-2008-Max-margin Classification of Data with Absent Features

18 0.41624364 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

19 0.41244856 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

20 0.41226569 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity    (Special Topic on Model Selection)