nips nips2009 nips2009-239 knowledge-graph by maker-knowledge-mining

239 nips-2009-Submodularity Cuts and Applications


Source: pdf

Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes

Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. [sent-26, score-0.588]

2 We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. [sent-27, score-0.81]

3 Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. [sent-30, score-0.157]

4 Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. [sent-31, score-0.111]

5 Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. [sent-32, score-0.126]

6 We evaluate our algorithm on sensor placement and feature selection applications showing good performance. [sent-33, score-0.261]

7 This type of function can be regarded as a discrete counterpart of convex functions, and includes entropy, symmetric mutual information, information gain, graph cut functions, and so on. [sent-38, score-0.288]

8 In recent years, treating machine learning problems as submodular set function maximization (usually under some constraint, such as limited cardinality) has been addressed in the community [10, 13, 22]. [sent-39, score-0.64]

9 In this paper, we address submodular function maximization under a cardinality constraint: s. [sent-40, score-0.684]

10 jp/ kawahara/ 1 combinatorial optimization, it is well-known that submodular maximization is NP-hard and the approximation factor of (1 − 1/e) (≈ 0. [sent-53, score-0.665]

11 63) achieved by the greedy algorithm [19] is the theoretical limit of a polynomial-time algorithm for positive and nondecreasing submodular functions [3]. [sent-54, score-0.855]

12 That is, in the worst case, any polynomial-time algorithm cannot give a solution whose function value is more than (1 − 1/e) times larger than the optimal value unless P=NP. [sent-55, score-0.13]

13 In feature selection or sensor placement, for example, one may be willing to spend much more time in the selecting phase, since once selected, items are used many times or for a long duration. [sent-58, score-0.164]

14 Unfortunately, there has been very little in the literature on finding exact but still practical solutions to submodular maximization [17, 14, 8]. [sent-59, score-0.664]

15 To the best of our knowledge, the algorithm by Nemhauser and Wolsey [17] is the only way for exactly maximizing a general form of nondecreasing submodular functions (other than naive brute force). [sent-60, score-0.787]

16 In this paper, we present a novel algorithm for maximizing a submodular set function under a cardinality constraint based on a cutting-plane method, which is implemented as an iterative small-scale binary-integer linear programming (BILP) procedure. [sent-62, score-0.77]

17 To this end, we derive the submodularity cut, a cutting plane that cuts off the feasible sets on which the objective function values are guaranteed to be not better than current best one, and this is based on the submodularity of a function and its Lov´ sz extension [15, 16]. [sent-63, score-1.601]

18 This cut assures convergence to the optimum in finite iterations and a allows the searching for better subsets in an efficient manner so that the algorithm can be applied to suitably-sized problems. [sent-64, score-0.333]

19 The existing algorithm [17] is infeasible for such problems since, as originally presented, it has no criterion for improving the solution efficiently at each iteration (we compare these algorithms empirically in Sect. [sent-65, score-0.126]

20 This enables us to judge the accuracy of the current best solution and to calculate an -optimal solution for a predetermined > 0 (cf. [sent-69, score-0.178]

21 Note that BILP is a special case of MIP and more efficient to solve in general, and the presented algorithm can be applied to any submodular functions while the existing one needs the nondecreasing property. [sent-74, score-0.75]

22 1 We evaluate the proposed algorithm on the applications of sensor placement and feature selection in text classification. [sent-75, score-0.286]

23 2, we present submodularity cuts and give a general description of the algorithm using this cutting plane. [sent-77, score-0.76]

24 Then, we describe a specific procedure for performing the submodularity cut algorithm in Sect. [sent-78, score-0.701]

25 3 and the way of updating an upper bound for calculating an -optimal solution in Sect. [sent-79, score-0.13]

26 Using this information, we cut off the feasible sets on which the objective function values are guaranteed to be not better than f (S0 ). [sent-85, score-0.295]

27 In this section, we address a method for solving the submodular maximization problem (1) based on this idea along the line of cutting-plane methods, as described by Tuy [23] (see also [6, 7]) and often successfully used in algorithms for solving mathematical programming problems [18, 11, 20]. [sent-86, score-0.692]

28 1 Lov´ sz extension a For dealing with the submodular maximization problem (1) in a way analogous to the continuous counterpart, i. [sent-88, score-0.921]

29 , convex maximization, we briefly describe an useful extension to submodular functions, called the Lov´ sz extension [15, 16]. [sent-90, score-0.896]

30 1 A submodular function is called nondecreasing if f (A) ≤ f (B) for (A ⊆ B). [sent-92, score-0.687]

31 For example, an entropy function is nondecreasing but a cut function on nodes is not. [sent-93, score-0.349]

32 1 ⇐⇒ H+ H- (continuous) ˆ f : Rn → R I S ∈ Rn ˆ is convex f H y1 H* P d1 v y2 d2 Figure 1: Illustration of cutting plane H. [sent-98, score-0.226]

33 ˆ ˆ ˆ ˆ Then, the Lov´ sz extension f : Rn → R corresponding to a general set function f : 2V → R, which a is not necessarily submodular, is defined as m−1 ˆ f (p) = (ˆk − pk+1 )f (Uk ) + pm f (Um ), p ˆ ˆ (2) k=1 ˆ where Uk = {i ∈ V : pi ≥ pk }. [sent-102, score-0.344]

34 However, the following relationship between the submodularity ˆ of f and the convexity of f is given [15, 16]: ˆ Theorem 1 For a set function f : 2V → R and its Lov´ sz extension f : Rn → R, f is submodular a ˆ if and only if f is convex. [sent-107, score-1.271]

35 3 Then, ˆ the Lov´ sz extension f is a natural extension of f in the sense that it satisfies the following [15, 16]: a ˆ f (I S ) = f (S) (S ⊆ V ). [sent-111, score-0.307]

36 Now we introduce a continuous relaxation of the ˆ problem (1) using the Lov´ sz extension f . [sent-113, score-0.281]

37 A polytope P ⊆ Rn is a bounded intersection of a finite a set of half-spaces — that is, P is of the form P = {x ∈ Rn : Aj x ≤ bj , j = 1, · · · , m}, where Aj is a real vector and bj is a real scalar. [sent-114, score-0.196]

38 2 Submodularity cuts Here, we derive what we call the submodularity cut, a cutting plane that cuts off the feasible sets with optimality guarantees using the submodularity of f , and with the help of the relationship between submodularity and convexity described in Thm. [sent-120, score-1.973]

39 Note that the algorithm using this cutting plane, described later, converges to an optimal solution in a finite number of iterations (cf. [sent-122, score-0.262]

40 The presented technique is essentially a discrete analog of concavity cut techniques for continuous concave minimization, which rests on the following property (see, e. [sent-125, score-0.308]

41 Theorem 2 A convex function g : Rn → R attains its global maximum over a polytope P ⊂ Rn at a vertex of P . [sent-128, score-0.218]

42 2 For a submodular function, the Lov´ sz extension (2) is known to be equal to a ˆ f (p) = sup{pT x : x ∈ B(f )} (p ∈ Rn ), where B(f ) = {x ∈ Rn : x(S) ≤ f (S) (∀S ⊂ V ), x(V ) = f (V )} is the base polyhedron associated with P f [15] and x(S) = i∈S xi . [sent-129, score-0.79]

43 Thus, the cutting plane for the ¯ ¯ largest value, any f (S) ˆ problem max{f (x) : x ∈ D0 } can be applied to our problem (1) through the relationship (4). [sent-140, score-0.183]

44 To derive the submodularity cut, we use the following definition: Definition 3 (γ-extension) Let g : Rn → R be a convex function, x ∈ Rn , γ be a real number satisfying γ ≥ g(x) and t > 0. [sent-141, score-0.496]

45 The γ-extension of x ∈ Rn can be defined with respect to the Lov´ sz extension because it is a convex function. [sent-144, score-0.287]

46 a The submodular cut algorithm is an iterative procedure. [sent-145, score-0.82]

47 At each iteration, the algorithm keeps a polytope P ⊆ D0 , the current best function value γ, and a set S ∗ ⊆ V satisfying f (S ∗ ) = γ. [sent-146, score-0.169]

48 , dn ) be a convex polyhedral cone with vertex v generated by linearly independent vectors d1 , . [sent-151, score-0.328]

49 For each ˆ i = 1, · · · , n, let y l = v + θl dl be the γ-extension of v in direction dl with respect to f . [sent-157, score-0.11]

50 These directions are not necessarily chosen tightly on P (in fact, the directions described in Sect. [sent-165, score-0.112]

51 Since the vectors dl are linearly independent, there exists a unique hyperplane H = H(y 1 , · · · , y n ) that contains y l (l = 1, · · · , n), which we call a submodular cut. [sent-168, score-0.663]

52 Obviously the point v is in the halfspace H− , and moreover, we have: T n Lemma 4 Let P ⊆ D0 be a polytope, γ be the current best function value, v be a vertex of P such that v = I S for some S ∈ S(P ) and H− be the halfspace determined by the cutting plane, i. [sent-174, score-0.271]

53 Proof Since P ⊂ K = K(I S ; d1 , · · · , dn ), it follows that P ∩ H− is contained in the simplex ˆ R = [I S , y 1 , · · · , y n ]. [sent-184, score-0.123]

54 Since the Lov´ sz extension f is convex and the maximum of a convex a function over a compact convex set is attained at a vertex of the convex set (Thm. [sent-185, score-0.485]

55 The above lemma shows that we can cut off the feasible subsets S(P ∩ H− ) from S(P ) without loss of any feasible set whose objective function value is better than γ. [sent-190, score-0.372]

56 (7) The submodular cut algorithm updates P ← P ∩ H+ until the global optimality of γ is guaranteed. [sent-196, score-0.824]

57 H1 Hopt-1 Sopt S(Popt) Popt=Popt-1 ∩ H opt-1 Figure 2: Outline of the submodularity cuts algorithm. [sent-207, score-0.611]

58 Algorithm 1 General description of the submodularity cuts algorithm. [sent-208, score-0.611]

59 Construct with respect to Si−1 , Pi−1 and γi−1 a submodularity cut H i . [sent-217, score-0.661]

60 stop ← true (S ∗ is an optimal solution and γi−1 the optimal value). [sent-220, score-0.172]

61 1 Construction of submodularity cuts Given a vertex of a polytope P ⊆ D0 , which is of the form I S , we describe how to compute linearly independent directions d1 , · · · , dn for the construction of the submodularity cut at each iteration of the algorithm (Line 4 in Alg. [sent-240, score-1.717]

62 , dn can be chosen as −el (l ∈ S) and el (l ∈ V \ S). [sent-246, score-0.123]

63 , dn ) = {I S + t1 d1 + · · · + tn dn : tl ≥ 0} n contains the polytope D0 = {x ∈ Rn : 0 ≤ xl ≤ 1 (l = 1, · · · , n), l=1 xl ≤ k}. [sent-282, score-0.458]

64 5 Algorithm 2 Pseudo-code of the submodularity cuts algorithm using BILP. [sent-290, score-0.651]

65 Construct with respect to Si−1 , Pi−1 and γi−1 a submodularity cut H. [sent-299, score-0.661]

66 Solve the BILP problem (9) with respect to Aj and bj (j = 1, · · · , nk ), and let the optimal solution and value Si and c∗ , respectively. [sent-301, score-0.135]

67 stop ← true (S ∗ is an optimal solution and γi−1 the optimal value). [sent-304, score-0.172]

68 (6), we obtain: Proposition 7 Let c∗ be the optimal value of the binary integer program max {eT Y −1 x : Aj x ≥ bj , j = 1, · · · , mk }. [sent-324, score-0.139]

69 4 Upper bound and -optimal solution Although our algorithm can find an exact solution in a finite number of iterations, the computational cost could be expensive for a high-dimensional case. [sent-329, score-0.219]

70 Therefore, we present here an iterative update of an upper bound of the current solution, and thus a way to allow us to obtain an -optimal solution. [sent-330, score-0.116]

71 To this end, we combine the idea of the algorithm by Nemhauser and Wolsey [17] with our cutting plane algorithm. [sent-331, score-0.223]

72 If the submodular function f : 2V → R is nondecreasing, the submodular maximization problem (1) can be reformulated [17] as max η s. [sent-333, score-1.215]

73 The solution of a maximization problem with a subset of constraints is larger than the one with all constraints, so the good news is that this solution is guaranteed to improve (by monotonically decreasing down to the true solution) at each iteration. [sent-341, score-0.281]

74 Therefore, by adding the constraint corresponding to Si at each iteration of our algorithm and solving the reduced MIP above, we can evaluate an upper bound of the current solution. [sent-343, score-0.178]

75 Thus, we can assure the optimality of a current solution, or obtain a desired -optimal solution using both the lower and upper bound. [sent-344, score-0.154]

76 1, and then apply the algorithm to the real-world applications of sensor placement, and feature selection in text classification (Sect. [sent-349, score-0.229]

77 In the experiments, we used the solution by a greedy algorithm as initial subset S0 . [sent-353, score-0.199]

78 1 Artificial example Here, we evaluate empirically and illustrate the submodularity cut algorithm (Alg. [sent-363, score-0.701]

79 2) with respect to (1) computational time for exact solutions compared with the existing algorithm and (2) how fast the algorithm can sandwich the true solution between the upper and lower bounds, using artificial datasets. [sent-364, score-0.205]

80 , the submodular maximization problem (1) with respect to the nondecreasing submodular function: m f (S) = i=1 maxj∈S cij , where C = cij is an m × n nonnegative matrix and V = {1, · · · , n}. [sent-367, score-1.375]

81 This could be because the cuttingplane algorithm searches feasible subsets in an efficient manner by eliminating worse ones with the submodularity cuts. [sent-374, score-0.577]

82 The lower bound is updated rarely and converges to the optimal solution quickly while the upper bound decreases gradually. [sent-378, score-0.186]

83 2 Sensor placements Our first example with real data is the sensor placements problem, where we try to select sensor locations to minimize the variance of observations. [sent-380, score-0.411]

84 The dataset we used here is temperature measurements at discretized finite locations V obtained using the NIMS sensor node deployed at a lake near the University of California, Merced [9, 12] (|V | = 86). [sent-381, score-0.147]

85 5 show the computation time of our algorithm, and the accuracy improvement of our calculated solution over that of the greedy algorithm (%), respectively, for = 0. [sent-385, score-0.168]

86 3 Feature selection in text classification Our second real test case is feature selection in document classification using the Reuters-21578 dataset. [sent-412, score-0.109]

87 We applied the greedy and submodularity cuts algorithms to the training set that includes 7,770 documents with 5,180 words (features) and 90 categories, where we used the information gain as a criterion [4]. [sent-413, score-0.725]

88 The submodularity cuts algorithm selected several different words from that of the greedy algorithm. [sent-417, score-0.741]

89 6 Conclusions In this paper, we presented a cutting-plane algorithm for submodular maximization problems, which can be implemented as an iterative binary-integer linear programming procedure. [sent-419, score-0.758]

90 We derived a cutting plane procedure, called the submodularity cut, based on the submodularity of a set function through the Lov´ sz extension, and showed this cut assures that the algorithm converges to the optia mum in finite iterations. [sent-420, score-1.553]

91 Moreover, we presented a way to evaluate an upper bound of the optimal value with the help of Nemhauser and Wolsey [17], which enables us to ensure the accuracy of the current best solution and to calculate an intended -optimal solution for a predetermined > 0. [sent-421, score-0.272]

92 Our new algorithm computationally compared favorably against the existing algorithm on artificial datasets, and also showed improved performance on the real-world applications of sensor placements and feature selection in text classification. [sent-422, score-0.34]

93 The submodular maximization problem treated in this paper covers broad range of applications in machine learning. [sent-423, score-0.664]

94 In future works, we will develop frameworks with -optimality guarantees for more general problem settings such as knapsack constraints [21] and not nondecreasing submodular functions. [sent-424, score-0.727]

95 This will be make the submodularity cuts framework applicable to a still wider variety of machine learning problems. [sent-425, score-0.611]

96 Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. [sent-515, score-0.193]

97 Maximizing a submodular function by integer programming: Polyhedral results for the quadratic case. [sent-522, score-0.584]

98 Maximizing submodular set functions: formulations and analysis of algorithms. [sent-541, score-0.546]

99 An analysis of approximations for maximizing for submodular set functions – I. [sent-559, score-0.606]

100 A note on maximizing a submodular set function subject to a knapsack constraint. [sent-567, score-0.623]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('submodular', 0.546), ('submodularity', 0.453), ('lov', 0.229), ('cut', 0.208), ('sz', 0.181), ('nemhauser', 0.178), ('cuts', 0.158), ('nondecreasing', 0.141), ('wolsey', 0.123), ('dn', 0.123), ('sensor', 0.122), ('cutting', 0.109), ('polytope', 0.106), ('rn', 0.102), ('bilp', 0.101), ('mip', 0.101), ('maximization', 0.094), ('pi', 0.077), ('plane', 0.074), ('placements', 0.071), ('vertex', 0.069), ('greedy', 0.065), ('polyhedral', 0.065), ('solution', 0.063), ('extension', 0.063), ('placement', 0.057), ('feasible', 0.057), ('directions', 0.056), ('stop', 0.055), ('dl', 0.055), ('si', 0.053), ('programming', 0.052), ('ej', 0.05), ('bj', 0.045), ('cardinality', 0.044), ('convex', 0.043), ('selection', 0.042), ('aj', 0.04), ('alse', 0.04), ('kawahara', 0.04), ('knapsack', 0.04), ('nagano', 0.04), ('popt', 0.04), ('algorithm', 0.04), ('upper', 0.038), ('integer', 0.038), ('discrete', 0.037), ('maximizing', 0.037), ('continuous', 0.037), ('japan', 0.035), ('assures', 0.035), ('halfspace', 0.035), ('hyperplane', 0.034), ('bilmes', 0.032), ('subset', 0.031), ('industrial', 0.031), ('tl', 0.03), ('operational', 0.03), ('characteristic', 0.03), ('et', 0.03), ('guaranteed', 0.03), ('optimality', 0.03), ('max', 0.029), ('bound', 0.029), ('correspondence', 0.029), ('jn', 0.029), ('krause', 0.029), ('predetermined', 0.029), ('ei', 0.028), ('convexity', 0.028), ('linearly', 0.028), ('fs', 0.027), ('tp', 0.027), ('optimal', 0.027), ('subsets', 0.027), ('neighbor', 0.026), ('concave', 0.026), ('yj', 0.026), ('iterative', 0.026), ('xl', 0.026), ('text', 0.025), ('combinatorial', 0.025), ('locations', 0.025), ('words', 0.025), ('constraint', 0.025), ('cij', 0.024), ('tn', 0.024), ('exact', 0.024), ('gain', 0.024), ('covers', 0.024), ('false', 0.023), ('lemma', 0.023), ('functions', 0.023), ('iteration', 0.023), ('pm', 0.023), ('checking', 0.023), ('iterations', 0.023), ('classi', 0.023), ('current', 0.023), ('successively', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 239 nips-2009-Submodularity Cuts and Applications

Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes

Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1

2 0.35755974 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

3 0.25570643 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

4 0.17920706 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

5 0.078137912 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

Abstract: We consider the problem of finding the M assignments with maximum probability in a probabilistic graphical model. We show how this problem can be formulated as a linear program (LP) on a particular polytope. We prove that, for tree graphs (and junction trees in general), this polytope has a particularly simple form and differs from the marginal polytope in a single inequality constraint. We use this characterization to provide an approximation scheme for non-tree graphs, by using the set of spanning trees over such graphs. The method we present puts the M -best inference problem in the context of LP relaxations, which have recently received considerable attention and have proven useful in solving difficult inference problems. We show empirically that our method often finds the provably exact M best configurations for problems of high tree-width. A common task in probabilistic modeling is finding the assignment with maximum probability given a model. This is often referred to as the MAP (maximum a-posteriori) problem. Of particular interest is the case of MAP in graphical models, i.e., models where the probability factors into a product over small subsets of variables. For general models, this is an NP-hard problem [11], and thus approximation algorithms are required. Of those, the class of LP based relaxations has recently received considerable attention [3, 5, 18]. In fact, it has been shown that some problems (e.g., fixed backbone protein design) can be solved exactly via sequences of increasingly tighter LP relaxations [13]. In many applications, one is interested not only in the MAP assignment but also in the M maximum probability assignments [19]. For example, in a protein design problem, we might be interested in the M amino acid sequences that are most stable on a given backbone structure [2]. In cases where the MAP problem is tractable, one can devise tractable algorithms for the M best problem [8, 19]. Specifically, for low tree-width graphs, this can be done via a variant of max-product [19]. However, when finding MAPs is not tractable, it is much less clear how to approximate the M best case. One possible approach is to use loopy max-product to obtain approximate max-marginals and use those to approximate the M best solutions [19]. However, this is largely a heuristic and does not provide any guarantees in terms of optimality certificates or bounds on the optimal values. LP approximations to MAP do enjoy such guarantees. Specifically, they provide upper bounds on the MAP value and optimality certificates. Furthermore, they often work for graphs with large tree-width [13]. The goal of the current work is to leverage the power of LP relaxations to the M best case. We begin by focusing on the problem of finding the second best solution. We show how it can be formulated as an LP over a polytope we call the “assignment-excluding marginal polytope”. In the general case, this polytope may require an exponential number of inequalities, but we prove that when the graph is a tree it has a very compact representation. We proceed to use this result to obtain approximations to the second best problem, and show how these can be tightened in various ways. Next, we show how M best assignments can be found by relying on algorithms for 1 second best assignments, and thus our results for the second best case can be used to devise an approximation algorithm for the M best problem. We conclude by applying our method to several models, showing that it often finds the exact M best assignments. 1 The M-best MAP problem and its LP formulation Consider a function on n variables defined as: f (x1 , . . . , xn ; θ) = θij (xi , xj ) + ij∈E θi (xi ) (1) i∈V where V and E are the vertices and nodes of a graph G with n nodes. We shall be interested in the M assignments with largest f (x; θ) value.1 Denote these by x(1) , . . . , x(M) , so that x(1) is the assignment that maximizes f (x; θ), x(2) is the 2nd best assignment, etc. The MAP problem (i.e., finding x(1) ) can be formulated as an LP as follows [15]. Let µ be a vector of distributions that includes {µij (xi , xj )}ij∈E over edge variables and {µi (xi )}i∈V over nodes. The set of µ that arise from some joint distribution is known as the marginal polytope [15] and is denoted by M(G). Formally: M(G) = {µ | ∃p(x) ∈ ∆ s.t. p(xi , xj ) = µij (xi , xj ) , p(xi ) = µi (xi )} . where ∆ is the set of distributions on x. The MAP problem can then be shown to be equivalent to the following LP:2 max f (x; θ) = max µ · θ , (2) x µ∈M(G) It can be shown that this LP always has a maximizing µ that is a vertex of M(G) and is integral. Furthermore, this µ corresponds to the MAP assignment x(1) . Although the number of variables in this LP is only O(|E| + |V |), the difficulty comes from an exponential number of linear inequalities generally required to describe the marginal polytope M(G). We shall find it useful to define a mapping between assignments x and integral vertices of the polytope. Given an integral vertex v ∈ M(G), define x(v) to be the assignment that maximizes vi (xi ). And, given an assignment z define v(z) to be the integral vertex in M(G) corresponding to the assignment z. Thus the LP in Eq. 2 will be maximized by v(x(1) ). One simple outer bound of the marginal polytope is the local polytope ML (G), which only enforces pairwise constraints between variables:     µi (xi ) = 1 (3) µij (xi , xj ) = µj (xj ), µij (xi , xj ) = µi (xi ), ML (G) = µ ≥ 0   x x x i j i The LP relaxation is then to maximize µ · θ where µ ∈ ML (G). For tree structured graphs, ML (G) = M(G) [15] and thus the LP relaxation yields the exact MAP x(1) . An LP Formulation for the 2nd -best MAP 2 Assume we found the MAP assignment x(1) and are now interested in finding x(2) . Is there a simple LP whose solution yields x(2) ? We begin by focusing on the case where G is a tree so that the local LP relaxation is exact. We first treat the case of a connected tree. To construct an LP whose solution is x(2) , a natural approach is to use the LP for x(1) (i.e., the LP in Eq. 2) but somehow eliminate the solution x(1) using additional constraints. This, however, is somewhat trickier than it sounds. The key difficulty is that the new constraints should not generate fractional vertices, so that the resulting LP is still exact. We begin by defining the polytope over which we need to optimize in order to obtain x(2) . 1 2 This is equivalent to finding P maximum probability assignments for a model p(x) ∝ ef (x;θ) . the P P P We use the notation µ · θ = ij∈E xi ,xj µij (xi , xj )θij (xi , xj ) + i xi µi (xi )θi (xi ) 2 Definition 1. The assignment-excluding marginal polytope is defined as: ˆ M(G, z) = {µ | ∃p(x) ∈ ∆ s.t. p(z) = 0, p(xi , xj ) = µij (xi , xj ), p(xi ) = µi (xi )} . ˆ M(G, z) is simply the convex hull of all (integral) vectors v(x) for x = z. (4) ˆ The following result shows that optimizing over M(G, x(1) ) will yield the second best soluˆ tion x(2) , so that we refer to M(G, x(1) ) as the second-best marginal polytope. Lemma 1. The 2nd best solution is obtained via the following LP: maxx=x(1) f (x; θ) = maxµ∈M(G,x(1) ) µ · θ. Furthermore, the µ that maximizes the LP on ˆ the right is integral and corresponds to the second-best MAP assignment x(2) . The proof is similar to that of Eq. 2: instead of optimizing over x, we optimize over distributions p(x), while enforcing that p(x(1) ) = 0 so that x(1) is excluded from the maximization. The key question which we now address is how to obtain a simple characterization of ˆ ˆ M(G, z). Intuitively, it would seems that M(G, z) should be “similar” to M(G), such that it can be described as M(G) plus some constraints that “block” the assignment z. To illustrate the difficulty in finding such “blocking” constraints, consider the following constraint, originally suggested by Santos [10]: i µi (zi ) ≤ n − 1. This inequality is not satisfied by µ = v(z) since v(z) attains the value n for the LHS of the above. Furthermore, for any x = z and µ = v(x), the LHS would be n − 1 or less. Thus, this inequality separates ˆ v(z) from all other integral vertices. One might conclude that we can define M(G, z) by adding this inequality to M(G). The difficulty is that the resulting polytope has fractional vertices,3 and maximizing over it won’t generally yield an integral solution. It turns out that there is a different inequality that does yield an exact characterization of ˆ M(G, z) when G is a tree. We now define this inequality and state our main theorem. Definition 2. Consider the functional I(µ, z) (which is linear in µ): (1 − di )µi (zi ) + I(µ, z) = i µij (zi , zj ) (5) ij∈E where di is the degree of node i in the tree graph G. ˆ Theorem 1. Adding the single inequality I(µ, z) ≤ 0 to M(G) yields M(G, z). ˆ M(G, z) = {µ | µ ∈ M(G), I(µ, z) ≤ 0 } (6) The theorem is proved in the appendix. Taken together with Lemma 1, it implies that x(2) may be obtained via an LP that is very similar to the MAP-LP, but has an additional constraint. We note the interesting similarity between I(µ, z) and the Bethe entropy [20]. The only difference is that in Bethe, µi , µij are replaced by H(Xi ), H(Xi , Xj ) respectively.4 The theorem also generalizes to the case where G is not a tree, but we have a junction tree for G. In this case, the theorem still holds if we define a generalized I(µ, z) inequality as: (1 − dS )µS (zS ) + S∈S µC (zC ) ≤ 0 (7) C∈C where C and S are the junction tree cliques and their separators, respectively, and dS is the number of cliques that intersect on separator S. In this case, the marginal polytope should enforce consistency between marginals µC (zC ) and their separators µS (zS ). However, such a characterization requires variables whose cardinality is exponential in the tree-width and is thus tractable only for graphs of low tree-width. In the next section, we address approximations for general graphs. A corresponding result exists for the case when G is a forest. In this case, the inequality in Eq. 6 is modified to: I(µ, z) ≤ |P | − 1, where |P | denotes the number of connected components of G. Interestingly, for a graph without edges, this gives the Santos inequality. 3 Consider the case of a single edge between 2 nodes where the MAP assignment is (0, 0). Adding the inequality µ1 (0) + µ2 (0) ≤ 1 produces the fractional vertex (0.5, 0.5). 4 The connection to Bethe can be more clearly understood from a duality-based proof of Theorem 1. We will cover this in an extended version of the manuscript. 3 2nd best LPs for general graphs - Spanning tree inequalities 3 When the graph G is not a tree, the marginal polytope M(G) generally requires an exponential number of inequalities. However, as mentioned above, it does have an exact description in terms of marginals over cliques and separators of a junction tree. Given such marginals on ˆ junction tree cliques, we also have an exact characterization of M(G, z) via the constraint in Eq. 7. However, in general, we cannot afford to be exponential in tree-width. Thus a common strategy [15] is to replace M(G) with an outer bound that enforces consistency between marginals on overlapping sets of variables. The simplest example is ML (G) in Eq. 3. ˆ In what follows, we describe an outer-bound approximation scheme for M(G, z). We use ML (G) as the approximation for M(G) (more generally ML (G) can enforce consistency between any set of small regions, e.g., triplets). When G is not a tree, the linear constraint in ˆ Eq. 6 will no longer suffice to derive M(G, z). Moreover, direct application of the inequality will incorrectly remove some integral vertices. An alternative approach is to add inequalities that separate v(z) from the other integral vertices. This will serve to eliminate more and more fractional vertices, and if enough constraints are added, this may result in an integral solution. One obvious family of such constraints are those corresponding to spanning trees in G and have the form of Eq. 5. Definition 3. Consider any T that is a spanning tree of G. Define the functional I T (µ, z): (1 − dT )µi (zi ) + i I T (µ, z) = i µij (zi , zj ) (8) ij∈T where dT is the degree of i in T . We refer to I T (µ, z) ≤ 0 as a spanning tree inequality. i For any sub-tree T of G, the corresponding spanning tree inequality separates the vertex v(z) from the other vertices. This can be shown via similar arguments as in the proof of Theorem 1. Note, however, that the resulting polytope may still have fractional vertices. The above argument shows that any spanning tree provides a separating inequality for ˆ M(G, z). In principle, we would like to use as many such inequalities as possible. Definition 4. The spanning tree assignment-excluding marginal polytope is defined as: ˆ MST (G, z) = µ | µ ∈ ML (G), L ∀ tree T ⊆ E I T (µ, z) ≤ 0 (9) where the ST notation indicates the inclusion of all spanning tree inequalities for G.5 Thus, we would actually like to perform the following optimization problem: max ˆ µ∈MST (G,z) L µ·θ ˆ as an approximation to optimization over M(G, z); i.e., we seek the optimal µ subject to all spanning tree inequalities for G with the ambition that this µ be integral and thus provide the non-z MAP assignment, with a certificate of optimality. Although the number of spanning trees is exponential in n, it turns out that all spanning inequalities can be used in practice. One way to achieve this is via a cutting plane algorithm [12] that finds the most violated spanning tree inequality and adds it to the LP. To implement this efficiently, we note that for a particular µ and a spanning tree T , the value of I T (µ, z) can be decomposed into a sum over the edges in T (and a T -independent constant): I T (µ, z) = µi (zi ) µij (zi , zj ) − µi (zi ) − µj (zj ) + (10) i ij∈T The tree maximizing the above is the maximum-weight spanning tree with edge-weights wij = µij (zi , zj ) − µi (zi ) − µj (zj ). It can thus be found efficiently. The cutting plane algorithm proceeds as follows. We start by adding an arbitrary spanning tree. Then, as long as the optimal µ is fractional, we find the spanning tree inequality that µ most violates (where this is implemented via the maximum-weight spanning tree). This constraint will necessarily remove µ from the polytope. If there are no violated inequalities 5 ˆ ˆL Note that M(G, z) ⊆ MST (G, z) ⊂ ML (G). 4 but µ is still fractional, then spanning tree inequalities do not suffice to find an integral solution (but see below on hypertree constraints to add in this case). In practice, we found that only a relatively small number of inequalities are needed to successfully yield an integral solution, or determine that all such inequalities are already satisfied. An alternative approach for solving the all spanning-tree problem is to work via the dual. The dual variables roughly correspond to points in the spanning tree polytope [16], optimization over which can be done in polynomial time, e.g., via the ellipsoid algorithm. We do not pursue this here since the cutting plane algorithm performed well in our experiments. ˆ As mentioned earlier, we can exactly characterize M(G, z) using Eq. 7, albeit at a cost exponential in the tree-width of the graph. A practical compromise would be to use inequalities over clique trees of G, where the cliques are relatively small, e.g., triplets. The corresponding constraint (Eq. 7 with the small cliques and their separators) will necessarily separate v(z) from the other integral vertices. Finding the maximally violated such inequality is an NP-hard problem, equivalent to a prize collecting Steiner tree problem, but recent work has found that such problems are often exactly solvable in practice [7]. It thus might be practical to include all such trees as constraints using a cutting plane algorithm. 4 From 2nd -best to M-best Thus far, we only dealt with the 2nd best case. As we show now, it turns out that the 2nd -best formalism can be used to devise an algorithm for M best. We begin by describing an algorithm for the exact M best and then show how it can be used to approximate those via the approximations for 2nd best described above. Fig. 1 describes our scheme, which we call Partitioning for Enumerating Solutions (or PES) for solving the M best problem. The scheme is general and only assumes that MAP-“like” problems can be solved. It is inspired by several pre-existing M best solution schemes [4, 6, 8, 19] but differs from them in highlighting the role of finding a second best solution within a given subspace. for m ← 1 to M do if m = 1 then Run MAP solver to obtain the best assignment: x(1) ≡ arg max f (x; θ) CONSTRAINTS1 ← ∅ else k ←− arg max ′ k′ ∈{1,...,m−1} f (y(k ) ; θ) // sub-space containing mth best assignment x(m) ← y(k) // mth best assignment // A variable choice that distinguishes x(m) from x(k) : (m) (v, a) ← any member of the set {(i, xi (m) ) : xi (k) = xi } CONSTRAINTSm ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(k) (as MAP) from subspace m CONSTRAINTSk ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(m) (as 2nd -best) from subspace k y(k) ← CalcNextBestSolution(CONSTRAINTSk , x(k) ) end y(m) ← CalcNextBestSolution(CONSTRAINTSm , x(m) ) end return {x(m) }M m=1 /* Find next best solution in sub-space defined by CONSTRAINTS */ Function CalcNextBestSolution(CONSTRAINTS, x(∗) ) // x(∗) is the MAP in the sub-space defined by CONSTRAINTS: Run MAP solver to obtain the second-best solution: y ≡ arg max f (x; θ), and return y. x=x(∗) ,CONSTRAINTS end Figure 1: Pseudocode for the PES algorithm. The modus operandi of the PES algorithm is to efficiently partition the search space while systematically excluding all previously determined assignments. Significantly, any MAP 5 Attractive Grids Ranks Run-times 1 50 Mixed Grids Ranks Run-times 1 50 0.5 0 S N B 0 Hard Protein SCP Ranks Run-times 1 50 0.5 S N B 0 0 S+R N+R B+R 0.5 S+R N+R B+R 0 S+R B B+R 0 S+R B B+R Figure 2: Number of best ranks and normalized run-times for the attractive and mixed grids, and the more difficult protein SCP problems. S, N, and B denote the STRIPES, Nilsson, and BMMF algorithms. Algorithms marked with +R denote that regions of variables were added for those runs. solver can be plugged into it, on the condition that it is capable of solving the arg max in the CalcNextBestSolution subroutine. The correctness of PES can be shown by observing that at the M th stage, all previous best solutions are excluded from the optimization and no other assignment is excluded. Of note, this simple partitioning scheme is possible due to the observation that the first-best and second-best MAP assignments must differ in the assignment of at least one variable in the graph. The main computational step of the PES algorithm is to maximize f (x; θ) subject to x = x(∗) and x ∈ CONSTRAINTS (see the CalcNextBestSolution subroutine). The CONSTRAINTS set merely enforces that some of the coordinates of x are either equal to or different from specified values.6 Within the LP, these can be enforced by setting µi (xi = a) = 1 or µi (xi = a) = 0. It can be shown that if one optimizes µ · θ with ˆ these constraints and µ ∈ M(G, x(∗) ), the solution is integral. Thus, the only element ˆ requiring approximation in the general case is the description of M(G, x(∗) ). We choose as ˆ this approximation the polytope MST (G, x(∗) ) in Eq. 9. We call the resulting approximaL tion algorithm Spanning TRee Inequalities and Partitioning for Enumerating Solutions, or STRIPES. In the next section, we evaluate this scheme experimentally. 5 Experiments We compared the performance of STRIPES to the BMMF algorithm [19] and the Lawler/Nilsson algorithm [6, 8]. Nilsson’s algorithm is equivalent to PES where the 2nd best assignment is obtained from maximizations within O(n) partitions, so that its runtime is O(n) times the cost of finding a single MAP. Here we approximated each MAP with its LP relaxation (as in STRIPES), so that both STRIPES and Nilsson come with certificates of optimality when their LP solutions are integral. BMMF relies on loopy BP to approximate the M best solutions.7 We used M = 50 in all experiments. To compare the algorithms, we pooled all their solutions, noting the 50 top probabilities, and then counted the fraction of these that any particular algorithm found (its solution rank). For run-time comparisons, we normalized the times by the longest-running algorithm for each example. We begin by considering pairwise MRFs on binary grid graphs of size 10 × 10. In the first experiment, we used an Ising model with attractive (submodular) potentials, a setting in which the pairwise LP relaxation is exact [14]. For each grid edge ij, we randomly chose Jij ∈ [0, 0.5], and local potentials were randomized in the range ±0.5. The results for 25 graphs are shown in Fig. 2. Both the STRIPES and Nilsson algorithms obtained the 50 optimal solutions (as learned from their optimality certificates), while BMMF clearly fared less well for some of the graphs. While the STRIPES algorithm took < 0.5 to 2 minutes to run, the Nilsson algorithm took around 13 minutes. On the other hand, BMMF was quicker, taking around 10 seconds per run, while failing to find a significant portion of the top solutions. Overall, the STRIPES algorithm was required to employ up to 19 spanning tree inequalities per calculation of second-best solution. 6 This is very different from the second best constraint, since setting x1 = 1 blocks all assignments with this value, as opposed to setting x = 1 which blocks only the assignment with all ones. 7 For BMMF, we used the C implementation at http://www.cs.huji.ac.il/~ talyam/ inference.html. The LPs for STRIPES and Nilsson were solved using CPLEX. 6 Next, we studied Ising models with mixed interaction potentials (with Jij and the local potentials randomly chosen in [−0.5, 0.5]). For almost all of the 25 models, all three algorithms were not able to successfully find the top solutions. Thus, we added regions of triplets (two for every grid face) to tighten the LP relaxation (for STRIPES and Nilsson) and to perform GBP instead of BP (for BMMF). This resulted in STRIPES and Nilsson always provably finding the optimal solutions, and BMMF mostly finding these solutions (Fig. 2). For these more difficult grids, however, STRIPES was the fastest of the algorithms, taking 0.5 - 5 minutes. On the other hand, the Nilsson and BMMF algorithms took 18 minutes and 2.5 7 minutes, respectively. STRIPES added up to 23 spanning tree inequalities per iteration. The protein side-chain prediction (SCP) problem is to to predict the placement of amino acid side-chains given a protein backbone [2, 18]. Minimization of a protein energy function corresponds to finding a MAP assignment for a pairwise MRF [19]. We employed the dataset of [18] (up to 45 states per variable, mean approximate tree-width 50), running all algorithms to calculate the optimal side-chain configurations. For 315 of 370 problems in the dataset, the first MAP solution was obtained directly as a result of the LP relaxation having an integral solution (“easy” problems). STRIPES provably found the subsequent top 50 solutions within 4.5 hours for all but one of these cases (up to 8 spanning trees per calculation), and BMMF found the same 50 solutions for each case within 0.5 hours; note that only STRIPES provides a certificate of optimality for these solutions. On the other hand, only for 146 of the 315 problems was the Nilsson method able to complete within five days; thus, we do not compare its performance here. For the remaining 55 (“hard”) problems (Fig. 2), we added problem-specific triplet regions using the MPLP algorithm [13]. We then ran the STRIPES algorithm to find the optimal solutions. Surprisingly, it was able to exactly find the 50 top solutions for all cases, using up to 4 standard spanning tree inequalities per second-best calculation. The STRIPES run-times for these problems ranged from 6 minutes to 23 hours. On the other hand, whether running BMMF without these regions (BP) or with the regions (GBP), it did not perform as well as STRIPES in terms of the number of high-ranking solutions or its speed. To summarize, STRIPES provably found the top 50 solutions for 369 of the 370 protein SCP problems. 6 Conclusion ˆ In this work, we present a novel combinatorial object M(G, z) and show its utility in obtaining the M best MAP assignments. We provide a simple characterization of it for tree structured graphs, and show how it can be used for approximations in non-tree graphs. As with the marginal polytope, many interesting questions arise about the properties of ˆ M(G, z). For example, in which non-tree cases can we provide a compact characterization (e.g., as for the cut-polytope for planar graphs [1]). Another compelling question is in which problems the spanning tree inequalities are provably optimal. An interesting generalization of our method is to predict diverse solutions satisfying some local measure of “distance” from each other, e.g., as in [2]. Here we studied the polytope that results from excluding one assignment. An intriguing question is to characterize the polytope that excludes M assignments. We have found that it does not simply correspond to adding M constraints I(µ, z i ) ≤ 0 for i = 1, . . . , M , so its ˆ geometry is apparently more complicated than that of M(G, z). Here we used LP solvers to solve for µ. Such generic solvers could be slow for large-scale problems. However, in recent years, specialized algorithms have been suggested for solving MAP-LP relaxations [3, 5, 9, 17]. These use the special form of the constraints to obtain local-updates and more scalable algorithms. We intend to apply these schemes to our method. Finally, our empirical results show that our method indeed leverages the power of LP relaxations and yields exact M best optimal solutions for problems with large tree-width. Acknowledgements We thank Nati Linial for his helpful discussions and Chen Yanover and Talya Meltzer for their insight and help in running BMMF. We also thank the anonymous reviewers for their useful advice. 7 A Proof of Theorem 1 Recall that for any µ ∈ M(G), there exists a probability density p(x) s.t. µ = x p(x)v(x). Denote pµ (z) as the minimal value of p(z) among all p(x) that give µ. We prove that ˆ pµ (z) = max(0, I(µ, z)), from which the theorem follows (since pµ (z) = 0 iff µ ∈ M(G, z)). The proof is by induction on n. For n = 1, the node has degree 0, so I(µ, z) = µ1 (z1 ). Clearly, pµ (z) = µ1 (z1 ), so pµ (z) = I(µ, z). For n > 1, there must exist a leaf in G ˆ (assume that its index is n and its neighbor’s is n − 1). Denote G as the tree obtained ˆ by removing node n and its edge with n − 1. For any assignment x, denote x as the corresponding sub-assignment for the first n − 1 variables. Also, any µ can be derived by ˆ ˆ adding appropriate coordinates to a unique µ ∈ M(G). For an integral vertex µ = v(x), ˆˆ ˆ ˆ ˆ ˆ x denote its projected µ as v (ˆ ). Denote by I(µ, z ) the functional in Eq. 5 applied to G. For ˆ any µ and its projected µ, it can be seen that: ˆˆ ˆ I(µ, z) = I(µ, z ) − α (11) where we define α = xn =zn µn−1,n (zn−1 , xn ) (so 0 ≤ α ≤ 1). The inductive assumption ˆ ˆ ˆ gives a p(ˆ ) that has marginals µ and also p(ˆ ) = max(0, I(µ, z )). We next use p(ˆ ) to ˆx ˆz ˆx construct a p(x) that has marginals µ and the desired minimal pµ (z). Consider three cases: ˆˆ ˆ I. I(µ, z) ≤ 0 and I(µ, z ) ≤ 0. From the inductive assumption, pµ (ˆ ) = 0, so we define: ˆˆ z µn−1,n (xn−1 , xn ) p(x) = p(ˆ ) ˆx (12) µn−1 (xn−1 ) which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0 as required. If µn−1 (xn−1 ) = 0, then p(ˆ ) is necessarily 0, in which case we define p(x) = 0. Note that this construction ˆx is identical to that used in proving that ML (G) = M(G) for a tree graph G. ˆˆ ˆ II. I(µ, z) > 0. Based on Eq. 11 and α ≥ 0, we have I(µ, z ) > 0. Applying the inductive ˆ µ, z ) = pµ (ˆ ) > 0. Now, define p(x) so that p(z) = I(µ, z): ˆ assumption to µ, we obtain I( ˆ ˆ ˆˆ z xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) p(x) no constraint 0 no constraint As in Eq. 12 0 0 ∃ l x l = zl 1 ∀ l x l = zl 1 µn−1,n (zn−1 , xn ) 1 1 p(ˆ ) ˆx 0 I(µ, z) Simple algebra shows that p(x) is non-negative and has µ as marginals. We now show that p(z) is minimal. Based on the inductive assumption and Eq. 11, it can easily be shown that I(v(z), z) = 1, I(v(x), z) ≤ 0 for x = z. For any p(x) s.t. µ = x p(x)v(x), from linearity, I(µ, z) = p(z) + x=z p(x)I(v(x), z) ≤ p(z) (since I(v(x), z) ≤ 0 for x = z). Since the p(z) we define achieves this lower bound, it is clearly minimal. ˆˆ ˆ ˆ III. I(µ, z) ≤ 0 but I(µ, z ) > 0. Applying the inductive assumption to µ, we see that ˆ µ, z ) > 0; Eq. 11 implies α − I(µ, z ) ≥ 0. Define β = µn−1 (zn−1 ) − pµ (ˆ ), which ˆˆ ˆ ˆˆ z pµ (ˆ ) = I( ˆ ˆ ˆˆ z ˆ is non-negative since µn−1 (zn−1 ) = µn−1 (ˆ n−1 ) and p marginalizes to µ. Define p(x) as: ˆ z ˆ xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) no constraint 0 no constraint ∃ l x l = zl As in Eq. 12 0 ˆ ˆ z µ (z ,x ) p(ˆ ) n−1,n βn−1 n α−I(µ,ˆ ) ˆx α µ (z ,z ) p(ˆ ) n−1,n βn−1 n ˆx (z ,x ) ˆˆ ˆ µ I(µ, z ) n−1,n αn−1 n 1 0 0 1 1 ∀ l x l = zl p(x) 1 which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0, as required. 8 References [1] F. Barahona. On cuts and matchings in planar graphs. Math. Program., 60(1):53–68, 1993. [2] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics, 75:682–705, 2009. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2007. [4] E. Kloppmann, G. M. Ullmann, and T. Becker. An extended dead-end elimination algorithm to determine gap-free lists of low energy states. Journal of Comp. Chem., 28:2325–2335, 2007. [5] N. Komodakis and N. Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In D. Forsyth, P. Torr, and A. Zisserman, editors, ECCV, pages 806–820, Heidelberg, Germany, 2008. Springer. [6] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18(7):401–405, 1972. [7] I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Mathematical Programming, 105:427–449, Feb 2006. [8] D. Nilsson. An efficient algorithm for finding the M most probable configurations in probabilistic expert systems. Statistics and Computing, 8:159–173, Jun 1998. [9] P. Ravikumar, A. Agarwal, and M. Wainwright. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In Proc. of the 25th international conference on Machine learning, pages 800–807, New York, NY, USA, 2008. ACM. [10] E. Santos. On the generation of alternative explanations with implications for belief revision. In Proc. of the 7th Annual Conference on Uncertainty in Artificial Intelligence, 1991. [11] Y. Shimony. Finding the MAPs for belief networks is NP-hard. 68(2):399–410, 1994. Aritifical Intelligence, [12] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2007. [13] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proc. of the 24th Annual Conference on Uncertainty in Artificial Intelligence, pages 503–510, 2008. [14] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and bregman projections. J. Mach. Learn. Res., 7:1627–1653, 2006. [15] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008. [16] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005. [17] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1165–1179, 2007. [18] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Journal of Machine Learning Research, 7:1887–1907, 2006. [19] C. Yanover and Y. Weiss. Finding the M most probable configurations using loopy belief propagation. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [20] J. Yedidia, W. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282– 2312, 2005. 9

6 0.059461378 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

7 0.056444552 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

8 0.055376738 166 nips-2009-Noisy Generalized Binary Search

9 0.05472235 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

10 0.051187899 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

11 0.049765099 187 nips-2009-Particle-based Variational Inference for Continuous Systems

12 0.048874106 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

13 0.048186716 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

14 0.046936415 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

15 0.046035424 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

16 0.045369584 221 nips-2009-Solving Stochastic Games

17 0.044902217 33 nips-2009-Analysis of SVM with Indefinite Kernels

18 0.042098045 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

19 0.041672539 238 nips-2009-Submanifold density estimation

20 0.040791083 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, 0.145), (2, 0.052), (3, -0.054), (4, 0.038), (5, 0.079), (6, -0.061), (7, 0.021), (8, -0.08), (9, 0.014), (10, -0.196), (11, -0.01), (12, 0.06), (13, 0.073), (14, -0.113), (15, 0.372), (16, 0.085), (17, -0.047), (18, -0.111), (19, -0.016), (20, -0.057), (21, 0.036), (22, -0.047), (23, -0.112), (24, -0.052), (25, -0.065), (26, -0.05), (27, 0.007), (28, -0.14), (29, 0.153), (30, 0.14), (31, 0.09), (32, -0.044), (33, 0.109), (34, 0.024), (35, 0.019), (36, 0.066), (37, 0.036), (38, 0.111), (39, -0.028), (40, -0.109), (41, 0.007), (42, -0.115), (43, 0.119), (44, 0.016), (45, -0.009), (46, -0.127), (47, 0.05), (48, 0.009), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92408377 239 nips-2009-Submodularity Cuts and Applications

Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes

Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1

2 0.77823478 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

3 0.70060921 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

4 0.51070637 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

5 0.36869544 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

6 0.33863434 180 nips-2009-On the Convergence of the Concave-Convex Procedure

7 0.32001773 48 nips-2009-Bootstrapping from Game Tree Search

8 0.31261086 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

9 0.29421639 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

10 0.29359135 166 nips-2009-Noisy Generalized Binary Search

11 0.28315142 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

12 0.2767981 220 nips-2009-Slow Learners are Fast

13 0.27512145 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

14 0.26928303 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

15 0.26511854 7 nips-2009-A Data-Driven Approach to Modeling Choice

16 0.25912783 24 nips-2009-Adapting to the Shifting Intent of Search Queries

17 0.25719643 238 nips-2009-Submanifold density estimation

18 0.25486854 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

19 0.2461713 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

20 0.24312922 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (24, 0.126), (25, 0.048), (35, 0.054), (36, 0.12), (39, 0.058), (58, 0.073), (61, 0.023), (71, 0.046), (81, 0.019), (86, 0.071), (91, 0.02), (95, 0.245)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82922244 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

Author: Samory Kpotufe

Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1

same-paper 2 0.80728918 239 nips-2009-Submodularity Cuts and Applications

Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes

Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1

3 0.64681476 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

4 0.64548647 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

5 0.64043862 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

6 0.63926548 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

7 0.6387409 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

8 0.63639313 72 nips-2009-Distribution Matching for Transduction

9 0.63367683 221 nips-2009-Solving Stochastic Games

10 0.63350445 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

11 0.63344985 220 nips-2009-Slow Learners are Fast

12 0.63278413 129 nips-2009-Learning a Small Mixture of Trees

13 0.63228536 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

14 0.63214594 161 nips-2009-Nash Equilibria of Static Prediction Games

15 0.63080966 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

16 0.63029796 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.62879765 94 nips-2009-Fast Learning from Non-i.i.d. Observations

18 0.62597769 14 nips-2009-A Parameter-free Hedging Algorithm

19 0.62528944 180 nips-2009-On the Convergence of the Concave-Convex Procedure

20 0.62459207 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference