nips nips2007 nips2007-120 knowledge-graph by maker-knowledge-mining

120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching


Source: pdf

Author: Sujay Sanghavi, Dmitry Malioutov, Alan S. Willsky

Abstract: Loopy belief propagation has been employed in a wide variety of applications with great empirical success, but it comes with few theoretical guarantees. In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. This provides an exact characterization of max-product performance and reveals connections to the widely used optimization technique of LP relaxation. In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. [sent-5, score-0.495]

2 We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. [sent-6, score-0.943]

3 In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. [sent-8, score-0.46]

4 In this paper we study the application of the widely used max-product form of LBP (or simply max-product (MP) algorithm), to the weighted matching problem. [sent-14, score-0.396]

5 Given a graph G = (V, E) with non-negative weights we on its edges e ∈ E, the weighted matching problem is to find the heaviest set of mutually disjoint edges (i. [sent-15, score-0.933]

6 a set of edges such that no two edges share a node). [sent-17, score-0.386]

7 Weighted matching is a classic problem that has played a central role in computer science and combinatorial optimization, with applications in resource allocation, scheduling in communications networks [10], and machine learning [5]. [sent-18, score-0.313]

8 Weighted matching thus naturally suggests itself as a good candidate for the study of convergence and correctness of algorithms like max-product. [sent-20, score-0.288]

9 Weighted matching can be naturally formulated as an integer program. [sent-21, score-0.318]

10 The technique of linear programming (LP) relaxation involves replacing the integer constraints with linear inequality constraints. [sent-22, score-0.248]

11 This relaxation is tight if the resulting linear program has an integral optimum. [sent-23, score-0.311]

12 LP relaxation is not always tight for the weighted matching problem. [sent-24, score-0.653]

13 The primary contribution of this paper is an exact characterization of max-product performance for the matching problem, which also establishes a link to LP relaxation. [sent-25, score-0.336]

14 We show that (i) if the LP relaxation is tight then max-product 1 converges to the correct answer, and (ii) if the LP relaxation is not tight then max-product does not converge. [sent-26, score-0.557]

15 Weighted matching is a special case of the weighted b-matching problem, where there can be up to bi edges touching node i (setting all bi = 1 reduces to simple matching). [sent-27, score-0.871]

16 However, in the interests of clarity, we provide proofs only for the conceptually easier case of simple matchings where bi = 1. [sent-29, score-0.199]

17 al [2] established that max-product converges for weighted matching in bipartite graphs, and [5] extended this result to b-matching. [sent-32, score-0.478]

18 These results are implied by our result1 , as for bipartite graphs, the LP relaxation is always tight. [sent-33, score-0.232]

19 In Section 2 we set up the weighted matching problem and its LP relaxation. [sent-34, score-0.396]

20 We describe the maxproduct algorithm for weighted matching in Section 3. [sent-35, score-0.396]

21 2 Weighted Matching and its LP Relaxation Suppose that we are given a graph G with weights we , we also positive integers bi for each node i ∈ V . [sent-38, score-0.271]

22 A b-matching is any set of edges such that the total number of edges in the set incident to any node i is at most bi . [sent-39, score-0.618]

23 Weighted b-matching can be naturally formulated as the following integer program (setting all bi = 1 gives an integer program for simple matching): IP : max we xe , e∈E xe ≤ bi for all i ∈ V, s. [sent-41, score-1.168]

24 e∈Ei xe ∈ {0, 1} for all e ∈ E Here Ei is the set of edges incident to node i. [sent-43, score-0.753]

25 The linear programming (LP) relaxation of the above problem is to replace the constraint xe ∈ {0, 1} with the constraint xe ∈ [0, 1], for each e ∈ E. [sent-44, score-1.074]

26 The LP relaxation is said to be tight if the unique optimum is integral (i. [sent-47, score-0.317]

27 Note that the LP relaxation is not tight in general. [sent-50, score-0.257]

28 Apart from the bipartite case, the tightness of LP relaxation is a function of both the weights and the graph structure2 . [sent-51, score-0.321]

29 Associate a binary variable xe ∈ {0, 1} with each edge e ∈ E, and consider the following probability distribution: p(x) ∝ ψ(xEi ) i∈V exp(we xe ), (1) e∈E which contains a factor ψ(xEi ) for each node i ∈ V , the value of which is ψ(xEi ) = 1 if e∈Ei xe ≤ bi , and 0 otherwise. [sent-54, score-1.61]

30 Note that we use i to refer both to the nodes of G and factors of p, and e to refer both to the edges of G and variables of p. [sent-55, score-0.268]

31 The factor ψ(xEi ) enforces the constraint that at most one edge incident to node i can be assigned the value “1”. [sent-56, score-0.276]

32 It is easy to see that, for any x, p(x) = exp( e we xe ) if the set of edges {e|xe = 1} constitute a b-matching in G, and p(x) = 0 otherwise. [sent-57, score-0.621]

33 If all we = 1, LP relaxation is loose: setting each xe = 1 is the optimum. [sent-61, score-0.621]

34 However, if instead the weights are {1, 1, 3}, then LP relaxation is tight. [sent-62, score-0.247]

35 However, for our problem the special form of ψ makes mt [1] this step easy even for large degrees: for each edge e ∈ Ei compute ae = max 1, me→i [0] . [sent-72, score-0.469]

36 Then, mt+1 [1] = i→e mt →i [0] e ae e ∈Ge , e ∈Ei −e mt+1 [0] = i→e mt →i [0] e ae e ∈Fe e ∈Ei −e These updates are linear in the degree |Ei |. [sent-74, score-0.702]

37 The Computation Tree for Weighted Matching Our proofs rely on the computation tree interpretation [12, 11] of loopy max-product beliefs, which we now describe for the special case of simple matching (bi = 1). [sent-75, score-0.449]

38 Recall the variables of p correspond to edges in G, and nodes in G correspond to factors. [sent-76, score-0.244]

39 For any edge e, the computation tree Te (1) at time 1 is just the edge e, the root of the tree. [sent-77, score-0.344]

40 The weights of the edges in Te are copied from the corresponding edges in G. [sent-80, score-0.44]

41 Suppose M is a matching on the original graph G, and Te is a computation tree. [sent-81, score-0.323]

42 Then, the image of M in Te is the set of edges in Te whose corresponding copy in G is a member of M . [sent-82, score-0.229]

43 G appears on the left, the numbers are the edge weights and the letters are node labels. [sent-85, score-0.254]

44 The max-weight matching M ∗ = {(a, b), (c, d)} is depicted in bold. [sent-86, score-0.288]

45 In the center plot we show T(a,b) (4), the computation tree at time t = 4 rooted at edge (a, b). [sent-87, score-0.207]

46 The bold edges in the middle tree depict the matching which is the image of M ∗ onto T(a,b) (4). [sent-89, score-0.593]

47 6, and it is easy to see that any matching on T(a,b) (4) that includes the root edge will have weight at most 6. [sent-91, score-0.454]

48 On the right we depict M , the max-weight matching on the tree T(a,b) (4). [sent-93, score-0.372]

49 In this example we see that even though (a, b) is in the unique optimal matching in G, the beliefs at the root are such that n4 [0] > n4 [1]. [sent-96, score-0.436]

50 Note also that the dotted edges are not an image of any matching in the (a,b) (a,b) original graph G. [sent-97, score-0.516]

51 Theorem 1 Let G = (V, E) be a graph with nonnegative real weights we on the edges e ∈ E. [sent-102, score-0.282]

52 Assume the linear programming relaxation LP has a unique optimal solution. [sent-103, score-0.277]

53 The above theorem implies that LP relaxation and Max-product will both succeed, or both fail, on the same problem instances, and thus are equally powerful for the weighted matching problem. [sent-112, score-0.589]

54 wij ≤ zi + zj for all (i, j) ∈ E, zi ≥ 0 for all i ∈ V The following lemma states that the standard linear programming properties of complimentary slackness hold in the strict sense for the weighted matching problem (this is a special case of [3, ex. [sent-120, score-0.75]

55 Lemma 1 (strict complimentary slackness) If the solution to LPis unique and integral, and M ∗ is the optimal matching, then there exists an optimal dual solution z to DUAL such that 1. [sent-123, score-0.178]

56 There exists > 0 such that for all (i, j) ∈ M ∗ we have wij ≤ zi + zj − / 3. [sent-125, score-0.193]

57 if no edge in M ∗ is incident on node i, then zi = 0 4. [sent-126, score-0.323]

58 zi ≤ maxe we for all i Let t ≥ 2wmax , where wmax = maxe we is the weight of the heaviest edge, and is as in part 2 of Lemma 1 above. [sent-127, score-0.176]

59 Suppose now that there exists an edge e ∈ M ∗ for which the belief at time t is / incorrect, i. [sent-128, score-0.197]

60 e e 4 Recall that nt [1] > nt [0] means that there is a matching M in Te (t) such that (a) the root e ∈ M , and e e ∗ (b) M is a max-weight matching on Te (t). [sent-131, score-0.792]

61 Note that this will be a path, ∗ because M and MT are matchings and so any node in Te (t) can have at most one edge adjacent to it in each of the two matchings. [sent-135, score-0.271]

62 It is easy to see that this path alternates between edges in M and MT . [sent-139, score-0.289]

63 Then, from parts 1 and 3 of Lemma 1 it follows that ∗ w(P ∩ MT ) = wij = zi + zj = zi . [sent-143, score-0.236]

64 ∗ (i,j)∈P ∩MT i∈P ∗ (i,j)∈P ∩MT On the other hand, from part 2 of Lemma 1 it follows that w(P ∩ M ) = wij ≤ (i,j)∈P ∩M zi + zj − zi = (i,j)∈P ∩M − |P ∩ M |. [sent-144, score-0.236]

65 We now show that M cannot be an optimal matching in Te (t). [sent-148, score-0.315]

66 We do so by “flipping” the edges in ∗ P to obtain a matching with higher weight. [sent-149, score-0.481]

67 Specifically, let M = M − (P ∩ M ) + (P ∩ MT ) be the matching containing all edges in M except the ones in P , which are replaced by the edges in ∗ P ∩ MT . [sent-150, score-0.674]

68 It is easy to see that M is a matching in Te (t), and that w(M ) > w(M ). [sent-151, score-0.288]

69 This contradicts the choice of M , and shows that for e ∈ M ∗ the beliefs satisfy nt [1] ≤ nt [0] for all t large enough. [sent-152, score-0.209]

70 For the weighted matching problem this local optimality implies global optimality, because the symmetric difference of any two matchings is a union of disjoint paths and cycles. [sent-161, score-0.56]

71 We first provide a combinatorial characterization of when the LP relaxation is loose. [sent-164, score-0.218]

72 An alternating path in G is a path in which every alternate edge is in M ∗ , and each node appears at most once. [sent-166, score-0.459]

73 A blossom is an alternating path that wraps onto itself, such that the result is a single odd cycle C and a path R leading out of that cycle3 . [sent-167, score-0.564]

74 The importance of blossoms for matching problems is well-known [4]. [sent-168, score-0.341]

75 A bad blossom is a blossom in which the edge weights satisfy w(C ∩ M ∗ ) + 2w(R ∩ M ∗ ) < w(C − M ∗ ) + 2w(R − M ∗ ). [sent-169, score-0.638]

76 3 The path may be of zero length, in which case the blossom is just the odd cycle. [sent-170, score-0.307]

77 5 b 3 Example: On the right is a bad blossom: bold edges are in M ∗ , numbers are edge weights and alphabets are node labels. [sent-171, score-0.557]

78 5 1 f 1 c g i 3 u 3 d A dumbbell is an alternating path that wraps onto itself twice, such that the result is two disjoint odd cycles C1 and C2 and an alternating path R connecting the two cycles. [sent-174, score-0.485]

79 In a bad dumbbell the edge weights satisfy w(C1 ∩ M ∗ ) + w(C2 ∩ M ∗ ) + 2w(R ∩ M ∗ ) < w(C1 − M ∗ ) + w(C2 − M ∗ ) + 2w(R − M ∗ ). [sent-175, score-0.318]

80 3 3 a 3 c 3 3 u 3 d 1 h f 3 3 j 3 i Proposition 1 If LP relaxation is loose, then in G there exists either a bad blossom, or a bad dumbbell. [sent-178, score-0.443]

81 Suppose now that max-product converges to M ∗ by iteration τ , and suppose also there exists a bad blossom B1 in G. [sent-182, score-0.392]

82 For an edge e ∈ B1 ∩ M ∗ consider the computation tree Te (τ + |V |) for e at time τ + |V |. [sent-183, score-0.178]

83 From the definition of convergence, it follows that near the root e, M will be the image of M ∗ onto Te : for any edge e in the tree at distance less than |V | from the root, e ∈ M if and only if its copy in G is in M ∗ . [sent-185, score-0.29]

84 This means that the copies in Te of the edges in B1 will contain an alternating path P in Te : every alternate edge of P will be in M . [sent-186, score-0.474]

85 For the bad blossom example above, the alternating path is ihgf cbaudcf ghi (it will go once around the cycle and twice around the path of the blossom). [sent-187, score-0.605]

86 Make a new matching M on Te (τ + |V |) by “switching” the edges in this path: M = M − (M ∩ P ) + (P − M ). [sent-188, score-0.481]

87 It follows from Proposition 1 that if LP relaxation is loose, then max-product cannot converge to M ∗ . [sent-193, score-0.193]

88 Hence, sparser networks with fewer edges are needed [7]. [sent-200, score-0.218]

89 The throughput of a link drops fast with distance, so the sparse network should mostly contain short edges. [sent-201, score-0.176]

90 An interesting set of sparse subgraph constructions with various tradeoffs addressing power efficiency in wireless networks is proposed in [7]. [sent-204, score-0.197]

91 984 0 0 10 20 30 40 50 40 60 60 80 100 120 140 160 180 200 (a) (b) Figure 2: (a) Histogram of node degrees versus node density. [sent-256, score-0.19]

92 (b) Average fraction of the LP upper bound on optimal cost obtained using LP relaxation and max-product. [sent-257, score-0.22]

93 We assign edge weights to be proportional to the throughput of the link. [sent-259, score-0.25]

94 odd-cycles where the weights on the edges are quite similar. [sent-272, score-0.247]

95 We show in simulations that bad blossoms appear rarely for the random graphs and weights in our construction, and LP-relaxation and MP produce nearly optimal b-matchings. [sent-273, score-0.278]

96 For the cases where LP relaxation has fractional edges, and MP has oscillating (or non-converged) edges, we erase them from the final matching and ensure that LP and MP solutions are valid matchings. [sent-274, score-0.551]

97 An important performance metric for sensor network self-organization is the power-stretch factor5 , which compares the weights of shortest paths in G to weights of shortest paths in the sparse subgraph. [sent-290, score-0.317]

98 We run MP for a fixed number of iterations, remove oscillating edges to get a valid matching, and plot the average fraction of the LP upper bound that the solution gets. [sent-299, score-0.25]

99 Sharma, “Maximum weight matching via max-product belief propagation,” in ISIT, Sept. [sent-316, score-0.337]

100 5 To compute the power-stretch the edges are weighted by d3 , i. [sent-387, score-0.301]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xe', 0.428), ('lp', 0.413), ('matching', 0.288), ('te', 0.287), ('mt', 0.276), ('relaxation', 0.193), ('edges', 0.193), ('blossom', 0.178), ('mp', 0.136), ('ei', 0.129), ('xei', 0.125), ('edge', 0.118), ('bad', 0.11), ('weighted', 0.108), ('bi', 0.1), ('path', 0.096), ('nt', 0.084), ('node', 0.082), ('throughput', 0.078), ('ae', 0.075), ('zi', 0.073), ('loopy', 0.073), ('matchings', 0.071), ('alternating', 0.067), ('sensor', 0.064), ('tight', 0.064), ('tree', 0.06), ('subgraph', 0.059), ('weights', 0.054), ('blossoms', 0.053), ('stretch', 0.053), ('nodes', 0.051), ('propagation', 0.05), ('incident', 0.05), ('belief', 0.049), ('root', 0.048), ('wij', 0.048), ('power', 0.047), ('converges', 0.043), ('fractional', 0.042), ('zj', 0.042), ('sparse', 0.042), ('beliefs', 0.041), ('bipartite', 0.039), ('copy', 0.036), ('lemma', 0.036), ('abcdu', 0.036), ('disconnect', 0.036), ('dumbbell', 0.036), ('fe', 0.036), ('maxe', 0.036), ('cycle', 0.035), ('graph', 0.035), ('leaf', 0.035), ('paths', 0.035), ('graphs', 0.034), ('network', 0.033), ('odd', 0.033), ('unique', 0.032), ('loose', 0.032), ('dual', 0.031), ('complimentary', 0.031), ('heaviest', 0.031), ('wraps', 0.031), ('suppose', 0.031), ('disjoint', 0.031), ('bp', 0.031), ('exists', 0.03), ('integer', 0.03), ('gure', 0.03), ('plot', 0.029), ('answer', 0.029), ('oscillating', 0.028), ('onto', 0.028), ('integral', 0.028), ('proofs', 0.028), ('sensors', 0.028), ('optimal', 0.027), ('optimality', 0.027), ('ge', 0.026), ('endpoints', 0.026), ('disconnected', 0.026), ('slackness', 0.026), ('lbp', 0.026), ('bayati', 0.026), ('degrees', 0.026), ('factor', 0.026), ('connected', 0.026), ('program', 0.026), ('programming', 0.025), ('malioutov', 0.025), ('networks', 0.025), ('characterization', 0.025), ('factors', 0.024), ('depict', 0.024), ('communicate', 0.024), ('wireless', 0.024), ('adjacency', 0.023), ('twice', 0.023), ('link', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

Author: Sujay Sanghavi, Dmitry Malioutov, Alan S. Willsky

Abstract: Loopy belief propagation has been employed in a wide variety of applications with great empirical success, but it comes with few theoretical guarantees. In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. This provides an exact characterization of max-product performance and reveals connections to the widely used optimization technique of LP relaxation. In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. 1

2 0.24920774 128 nips-2007-Message Passing for Max-weight Independent Set

Author: Sujay Sanghavi, Devavrat Shah, Alan S. Willsky

Abstract: We investigate the use of message-passing algorithms for the problem of finding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufficient conditions for correctness of the estimate. We then develop a modification of max-product – one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms in conjunction is correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over finite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation. 1

3 0.19663608 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be:   1 1 λ−c (xs ) + max  λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8

4 0.16798699 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

Author: Pawan Mudigonda, Vladimir Kolmogorov, Philip Torr

Abstract: The problem of obtaining the maximum a posteriori estimate of a general discrete random field (i.e. a random field defined using a finite and discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications, several approximate algorithms have been proposed in the literature. In this paper, we present an analysis of three such algorithms based on convex relaxations: (i) LP - S: the linear programming (LP) relaxation proposed by Schlesinger [20] for a special case and independently in [4, 12, 23] for the general case; (ii) QP - RL: the quadratic programming (QP) relaxation by Ravikumar and Lafferty [18]; and (iii) SOCP - MS: the second order cone programming (SOCP) relaxation first proposed by Muramatsu and Suzuki [16] for two label problems and later extended in [14] for a general label set. We show that the SOCP - MS and the QP - RL relaxations are equivalent. Furthermore, we prove that despite the flexibility in the form of the constraints/objective function offered by QP and SOCP, the LP - S relaxation strictly dominates (i.e. provides a better approximation than) QP - RL and SOCP - MS. We generalize these results by defining a large class of SOCP (and equivalent QP) relaxations which is dominated by the LP - S relaxation. Based on these results we propose some novel SOCP relaxations which strictly dominate the previous approaches.

5 0.11323982 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

Author: M.a. S. Elmohamed, Dexter Kozen, Daniel R. Sheldon

Abstract: We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which finds the single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations. 1

6 0.10336293 187 nips-2007-Structured Learning with Approximate Inference

7 0.096820094 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

8 0.094294816 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

9 0.090168215 141 nips-2007-New Outer Bounds on the Marginal Polytope

10 0.088808641 193 nips-2007-The Distribution Family of Similarity Distances

11 0.081892304 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

12 0.080365978 110 nips-2007-Learning Bounds for Domain Adaptation

13 0.079150252 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

14 0.074622132 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

15 0.069420144 63 nips-2007-Convex Relaxations of Latent Variable Training

16 0.068429336 112 nips-2007-Learning Monotonic Transformations for Classification

17 0.065240316 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

18 0.064594999 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

19 0.064329788 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

20 0.063495338 197 nips-2007-The Infinite Markov Model


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.175), (1, -0.028), (2, -0.106), (3, 0.035), (4, -0.077), (5, -0.28), (6, 0.057), (7, 0.199), (8, 0.125), (9, -0.089), (10, -0.043), (11, -0.075), (12, -0.099), (13, 0.098), (14, 0.064), (15, -0.062), (16, 0.01), (17, -0.014), (18, 0.04), (19, -0.057), (20, -0.176), (21, 0.04), (22, -0.064), (23, -0.121), (24, 0.024), (25, -0.107), (26, 0.045), (27, -0.078), (28, -0.001), (29, -0.12), (30, -0.077), (31, -0.055), (32, -0.019), (33, 0.107), (34, 0.106), (35, -0.046), (36, -0.119), (37, 0.055), (38, -0.004), (39, -0.097), (40, 0.021), (41, -0.039), (42, 0.003), (43, 0.02), (44, 0.157), (45, -0.015), (46, -0.049), (47, 0.068), (48, 0.016), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97961724 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

Author: Sujay Sanghavi, Dmitry Malioutov, Alan S. Willsky

Abstract: Loopy belief propagation has been employed in a wide variety of applications with great empirical success, but it comes with few theoretical guarantees. In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. This provides an exact characterization of max-product performance and reveals connections to the widely used optimization technique of LP relaxation. In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. 1

2 0.84739798 128 nips-2007-Message Passing for Max-weight Independent Set

Author: Sujay Sanghavi, Devavrat Shah, Alan S. Willsky

Abstract: We investigate the use of message-passing algorithms for the problem of finding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufficient conditions for correctness of the estimate. We then develop a modification of max-product – one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms in conjunction is correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over finite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation. 1

3 0.71742076 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

Author: Pawan Mudigonda, Vladimir Kolmogorov, Philip Torr

Abstract: The problem of obtaining the maximum a posteriori estimate of a general discrete random field (i.e. a random field defined using a finite and discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications, several approximate algorithms have been proposed in the literature. In this paper, we present an analysis of three such algorithms based on convex relaxations: (i) LP - S: the linear programming (LP) relaxation proposed by Schlesinger [20] for a special case and independently in [4, 12, 23] for the general case; (ii) QP - RL: the quadratic programming (QP) relaxation by Ravikumar and Lafferty [18]; and (iii) SOCP - MS: the second order cone programming (SOCP) relaxation first proposed by Muramatsu and Suzuki [16] for two label problems and later extended in [14] for a general label set. We show that the SOCP - MS and the QP - RL relaxations are equivalent. Furthermore, we prove that despite the flexibility in the form of the constraints/objective function offered by QP and SOCP, the LP - S relaxation strictly dominates (i.e. provides a better approximation than) QP - RL and SOCP - MS. We generalize these results by defining a large class of SOCP (and equivalent QP) relaxations which is dominated by the LP - S relaxation. Based on these results we propose some novel SOCP relaxations which strictly dominate the previous approaches.

4 0.58598357 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be:   1 1 λ−c (xs ) + max  λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8

5 0.46069762 141 nips-2007-New Outer Bounds on the Marginal Polytope

Author: David Sontag, Tommi S. Jaakkola

Abstract: We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that the approximations of the marginals are significantly more accurate when using the tighter outer bounds. Finally, we demonstrate the advantage of the new constraints for finding the MAP assignment in protein structure prediction. 1

6 0.45488662 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs

7 0.40213948 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

8 0.3895483 187 nips-2007-Structured Learning with Approximate Inference

9 0.34020096 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

10 0.32681251 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

11 0.31614566 193 nips-2007-The Distribution Family of Similarity Distances

12 0.29881442 101 nips-2007-How SVMs can estimate quantiles and the median

13 0.29734078 63 nips-2007-Convex Relaxations of Latent Variable Training

14 0.28045183 185 nips-2007-Stable Dual Dynamic Programming

15 0.27881292 78 nips-2007-Efficient Principled Learning of Thin Junction Trees

16 0.26941481 110 nips-2007-Learning Bounds for Domain Adaptation

17 0.26878676 150 nips-2007-Optimal models of sound localization by barn owls

18 0.26242191 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

19 0.25871834 112 nips-2007-Learning Monotonic Transformations for Classification

20 0.25759238 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (13, 0.043), (18, 0.021), (19, 0.011), (21, 0.053), (31, 0.02), (34, 0.028), (35, 0.038), (42, 0.02), (47, 0.075), (49, 0.011), (78, 0.195), (83, 0.232), (85, 0.063), (87, 0.011), (90, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8940587 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

Author: Sujay Sanghavi, Dmitry Malioutov, Alan S. Willsky

Abstract: Loopy belief propagation has been employed in a wide variety of applications with great empirical success, but it comes with few theoretical guarantees. In this paper we investigate the use of the max-product form of belief propagation for weighted matching problems on general graphs. We show that max-product converges to the correct answer if the linear programming (LP) relaxation of the weighted matching problem is tight and does not converge if the LP relaxation is loose. This provides an exact characterization of max-product performance and reveals connections to the widely used optimization technique of LP relaxation. In addition, we demonstrate that max-product is effective in solving practical weighted matching problems in a distributed fashion by applying it to the problem of self-organization in sensor networks. 1

2 0.88712031 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be:   1 1 λ−c (xs ) + max  λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8

3 0.88449609 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

Author: Jingrui He, Jaime G. Carbonell

Abstract: Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg’s Interleave method, the prior best technique in the sparse literature on this topic. 1

4 0.79425186 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

Author: Ronny Luss, Alexandre D'aspremont

Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.

5 0.78748357 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1

6 0.78430998 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

7 0.78375155 63 nips-2007-Convex Relaxations of Latent Variable Training

8 0.78197032 128 nips-2007-Message Passing for Max-weight Independent Set

9 0.78176486 116 nips-2007-Learning the structure of manifolds using random projections

10 0.78167713 141 nips-2007-New Outer Bounds on the Marginal Polytope

11 0.78163213 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

12 0.78008687 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

13 0.77978808 40 nips-2007-Bundle Methods for Machine Learning

14 0.77780336 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

15 0.77669674 187 nips-2007-Structured Learning with Approximate Inference

16 0.77388906 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

17 0.77281076 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

18 0.77212965 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

19 0.77030301 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

20 0.76896018 134 nips-2007-Multi-Task Learning via Conic Programming