nips nips2009 nips2009-103 knowledge-graph by maker-knowledge-mining

103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Source: pdf

Author: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufﬁcient condition that the Hessian of the Bethe free energy is positive deﬁnite, which shows non-convexity for graphs with multiple cycles. The formula clariﬁes the relation between the local stability of a ﬁxed point of LBP and local minima of the Bethe free energy. We also propose a new approach to the uniqueness of LBP ﬁxed point, and show various conditions of uniqueness. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. [sent-5, score-0.703]

2 It is applied to give a sufﬁcient condition that the Hessian of the Bethe free energy is positive deﬁnite, which shows non-convexity for graphs with multiple cycles. [sent-7, score-0.327]

3 The formula clariﬁes the relation between the local stability of a ﬁxed point of LBP and local minima of the Bethe free energy. [sent-8, score-0.41]

4 1 Introduction Pearl’s belief propagation [1] provides an efﬁcient method for exact computation in the inference with probabilistic models associated to trees. [sent-10, score-0.127]

5 One of the interesting theoretical aspects of LBP is its connection with the Bethe free energy [3]. [sent-12, score-0.276]

6 It is known, for example, the ﬁxed points of LBP correspond to the stationary points of the Bethe free energy. [sent-13, score-0.176]

7 This paper theoretically analyzes LBP by establishing a formula asserting that the determinant of the Hessian of the Bethe free energy equals the reciprocal of the edge zeta function up to a positive factor. [sent-15, score-0.767]

8 This formula derives a variety of results on the properties of LBP such as stability and uniqueness, since the zeta function has a direct link with the dynamics of LBP as we show. [sent-16, score-0.445]

9 The ﬁrst application of the formula is the condition for the positive deﬁniteness of the Hessian of the Bethe free energy. [sent-17, score-0.267]

10 The Bethe free energy is not necessarily convex, which causes unfavorable behaviors of LBP such as oscillation and multiple ﬁxed points. [sent-18, score-0.276]

11 Unlike the previous approaches which consider the global structure of the Bethe free energy such as [4, 5], we focus the local structure. [sent-20, score-0.311]

12 Second, we clarify a relation between the local stability of a LBP ﬁxed point and the local structure of the Bethe free energy. [sent-23, score-0.285]

13 Such a relation is not necessarily obvious, since LBP is not the gradient descent of the Bethe free energy. [sent-24, score-0.144]

14 In this line of studies, Heskes [6] shows that a locally stable ﬁxed point of LBP is a local minimum of the Bethe free energy. [sent-25, score-0.302]

15 It is thus interesting to ask which local 1 minima of the Bethe free energy are stable or unstable ﬁxed points of LBP. [sent-26, score-0.431]

16 We answer this question by elucidating the conditions of the local stability of LBP and the positive deﬁniteness of the Bethe free energy in terms of the eigenvalues of a matrix, which appears in the graph zeta function. [sent-27, score-0.726]

17 Finally, we discuss the uniqueness of LBP ﬁxed point by developing a differential topological result on the Bethe free energy. [sent-28, score-0.264]

18 The result shows that the determinant of the Hessian at the ﬁxed points, which appears in the formula of zeta function, must satisfy a strong constraint. [sent-29, score-0.423]

19 As a consequence, in addition to the known result on the one-cycle case, we show that the LBP ﬁxed point is unique for any unattractive connected graph with two cycles without restricting the strength of interactions. [sent-30, score-0.132]

20 2 Loopy belief propagation algorithm and the Bethe free energy Throughout this paper, G = (V, E) is a connected undirected graph with V the vertices and E the undirected edges. [sent-31, score-0.475]

21 ∑ In various applications, the computation of marginal distributions pi (xi ) := x\{xi } p(x) and ∑ pij (xi , xj ) := x\{xi xj } p(x) is required though the exact computation is intractable for large graphs. [sent-37, score-0.312]

22 If the graph is a tree, they are efﬁciently computed by Pearl’s belief propagation algorithm [1]. [sent-38, score-0.165]

23 The update rule of messages is given by ∑ ∏ µnew (xj ) ∝ ψji (xj , xi )ψi (xi ) µk→i (xi ), (2) i→j k∈Ni \j xi where Ni is the neighborhood of i ∈ V . [sent-42, score-0.282]

24 From (2) and (3), the constraints ∑ bij (xi , xj ) > 0 and xj bij (xi , xj ) = bi (xi ) are automatically satisﬁed. [sent-46, score-0.807]

25 We introduce the Bethe free energy as a tractable approximation of the Gibbs free energy. [sent-47, score-0.42]

26 The exact distribution (1) is characterized by a variational problem p(x) = argminp FGibbs (ˆ), where p ˆ the minimum is taken over all probability distributions on (xi )i∈V and FGibbs (ˆ) is the Gibbs free p ∫ energy deﬁned by FGibbs (ˆ) = KL(ˆ||p) − log Z. [sent-48, score-0.295]

27 ˆ p ˆ In the Bethe approximation, we conﬁne the above minimization to the distribution of the form ∏ ∏ b(x) ∝ ij∈E bij (xi , xj ) i∈V bi (xi )1−di , where di := |Ni | is the degree and the constraints ∑ ∑ bij (xi , xj ) > 0, xi ,xj bij (xi , xj ) = 1 and xj bij (xi , xj ) = bi (xi ) are satisﬁed. [sent-51, score-1.631]

28 A set {bi (xi ), bij (xi , xj )} satisfying these constraints is called pseudomarginals. [sent-52, score-0.295]

29 To put it more precisely, There is a one-to-one correspondence between the set of stationary points of the Bethe free energy and the set of ﬁxed points of LBP. [sent-57, score-0.308]

30 It is more convenient if we work with minimal parameters, mean mi = Ebi [xi ] and correlation χij = Ebij [xi xj ]. [sent-58, score-0.32]

31 Then we have an effective parametrization of pseudomarginals: 1 1 bij (xi , xj ) = (1 + mi xi + mj xj + χij xi xj ), bi (xi ) = (1 + mi ). [sent-59, score-1.356]

32 (5) 4 2 The Bethe free energy (4) is rewritten as ∑ ∑ F ({mi , χij }) = − Jij χij − hi m i ij∈E i∈V ∑ ∑ (1 + mi xi + mj xj + χij xi xj ) ∑ ∑ ( 1 + mi xi ) η + η + (1 − di ) , 4 2 x x x ij∈E i i∈V j (6) i where η(x) := x log x. [sent-60, score-1.488]

33 The domain of F is written as { } L(G) := {mi , χij } ∈ RN +M |1 + mi xi + mj xj + χij xi xj > 0 for all ij ∈ E and xi , xj = ±1 . [sent-61, score-1.276]

34 1 Zeta function and Hessian of Bethe free energy Zeta function and Ihara’s formula For each undirected edge of G, we make a pair of oppositely directed edges, which form a set of ⃗ ⃗ ⃗ directed edges E. [sent-66, score-0.504]

35 An equivalent class of closed geodesics is called a prime cycle if it is not a repeated concatenation of a shorter closed geodesic. [sent-75, score-0.132]

36 For given weights u = (ue )e∈E , the edge zeta function [7, 8] is deﬁned by ⃗ ∏ ζG (u) := (1 − g(p))−1 , g(p) := ue1 · · · uek for p = (e1 , . [sent-77, score-0.329]

37 , ek ), p∈P where ue ∈ C is assumed to be sufﬁciently small for convergence. [sent-80, score-0.176]

38 This is an analogue of the Riemann zeta function which is represented by the product over all the prime numbers. [sent-81, score-0.343]

39 For 1-cycle graph CN of length N , the prime cycles are (e1 , e2 , . [sent-84, score-0.163]

40 Except for these two types of graphs, the number of prime cycles is ¯ l=1 inﬁnite. [sent-91, score-0.125]

41 It is known that the edge zeta function has the following simple determinant formula, which gives ⃗ analytical continuation to the whole C2M . [sent-92, score-0.368]

42 ζG (u) = det(I − U M)−1 , where U is a diagonal matrix deﬁned by Ue,e′ := ue δe,e′ . [sent-96, score-0.16]

43 3 (8) We need to show another determinant formula of the edge zeta function, which is used in the proof of theorem 3. [sent-97, score-0.517]

44 We deﬁne two linear operators on C(V ) by ( ∑ ∑ ue ue ) ue ¯ ˆ ˆ (Df )(i) := f (i), (Af )(i) := f (o(e)), where f ∈ C(V ). [sent-101, score-0.48]

45 1 − ue u e 1 − u e ue ¯ ¯ ⃗ ⃗ e∈E t(e)=i e∈E t(e)=i (9) Then we have ( ζG (u)−1 = ) ˆ ˆ det(I − UM) = det(I + D − A) ∏ (1 − ue ue ). [sent-102, score-0.64]

46 ¯ (10) [e]∈E ⃗ If we set ue = u for all e ∈ E , the edge zeta function is called the Ihara zeta function [9] and denoted by ζG (u). [sent-103, score-0.775]

47 In this single variable case, theorem 2 is reduced to Ihara’s formula [10]: u2 u ζG (u)−1 = det(I − uM) = (1 − u2 )M det(I + D− A), (11) 1 − u2 1 − u2 where D is the degree matrix and A is the adjacency matrix deﬁned by ∑ (Df )(i) := di f (i), (Af )(i) := f (o(e)), f ∈ C(V ). [sent-104, score-0.185]

48 The following equality holds at any point of L(G): ( ) ∏ ∏ ∏ ∏ ζG (u)−1 = det(I − UM) = det(∇2 F ) bij (xi , xj ) bi (xi )1−di 22N +4M , ij∈E xi ,xj =±1 i∈V xi =±1 (12) where bij and bi are given by (5) and ui→j := χij − mi mj . [sent-107, score-1.13]

49 ∂χij ∂χkl 4 1 + mi + mj + χij 1 − mi + mj − χij 1 + mi − mj − χij 1 − mi − mj + χij Using this diagonal block, we erase (V,E)-block and (E,V)-block of the Hessian. [sent-111, score-1.064]

50 In other words, we choose a square matrix X such that det X = 1 and [ ] Y ( 0 ) T 2 2 X (∇ F )X = . [sent-112, score-0.19]

51 ∂ F 0 ∂χij ∂χkl After the computation given in the supplementary material, we see that  (χik −mi mk )2  1 2 +∑ k∈Ni (1−m2 )(1−m2 −m2 +2mi mk χik −χ2 ) 1−mi i i k ik (Y )i,j = χij −m −Ai,j 1−m2 −m2 +2mi mj χ −χ2 m i From uj→i = χij −mi mj , 1−m2 i j i j ij if i = j, otherwise. [sent-113, score-0.417]

52 (14) ij ˆ ˆ ˆ ˆ it is easy to check that IN + D − A = Y W , where A and D is deﬁned in (9) and W is a diagonal matrix deﬁned by Wi,j := δi,j (1 − m2 ). [sent-114, score-0.179]

53 Therefore, i ∏ ∏ det(I − UM) = det(Y ) (1 − m2 ) (1 − ue ue ) = R. [sent-115, score-0.32]

54 Theorem 3 shows that the determinant of the Hessian of the Bethe free energy is essentially equal to det(I −UM), the reciprocal of the edge zeta function. [sent-119, score-0.644]

55 4 4 Application to positive deﬁniteness conditions The convexity of the Bethe free energy is an important issue, as it guarantees uniqueness of the ﬁxed point. [sent-121, score-0.395]

56 Pakzad et al [11] and Heskes [5] derive sufﬁcient conditions of convexity and show that the Bethe free energy is convex for trees and graphs with one cycle. [sent-122, score-0.319]

57 In this section, instead of such global structure, we shall focus the local structure of the Bethe free energy as an application of the main formula. [sent-123, score-0.311]

58 We deﬁne mi (t) := mi and χij (t) := tχij + (1 − t)mi mj . [sent-132, score-0.438]

59 Using (14) and χij (0) = mi (0)mj (0), we can check that ∇2 F (0) is positive deﬁnite. [sent-138, score-0.197]

60 We deﬁne the symmetrization of ui→j and uj→i by Covbij [xi , xj ] χij − mi mj βi→j = βj→i := = . [sent-140, score-0.414]

61 (15) 2 )(1 − m2 )}1/2 {(1 − mi {Varbi [xi ]Varbj [xj ]}1/2 j Thus, ui→j uj→i = βi→j βj→i . [sent-141, score-0.172]

62 Then we have Z UMZ −1 = BM, because t(e) (Z UMZ −1 )e,e′ = (1 − m2 )1/2 ue (M)e,e′ (1 − m2 )−1/2 = βe (M)e,e′ . [sent-145, score-0.16]

63 As is seen from (11), α−1 is the distance from the origin to the nearest pole of Ihara’s zeta ζG (u). [sent-156, score-0.286]

64 The equation (16) is obtained by Hashimoto’s theorem [13], which gives the u → 1 limit of the Ihara zeta function. [sent-172, score-0.337]

65 5 5 Application to stability analysis In this section we discuss the local stability of LBP and the local structure of the Bethe free energy around a LBP ﬁxed point. [sent-179, score-0.436]

66 Heskes [6] shows that a locally stable ﬁxed point of sufﬁciently damped LBP is a local minima of the Bethe free energy. [sent-180, score-0.35]

67 A ﬁxed point η ∞ is called locally stable if LBP starting with a point sufﬁciently close to η ∞ converges to η ∞ . [sent-186, score-0.13]

68 A ﬁxed point is locally stable with some damping if and only if Spec(T ′ (η ∞ )) ⊂ {λ ∈ C|Reλ < 1}. [sent-190, score-0.138]

69 In section 4 of [16], they transform messages as µi→j → µi→j /µ∞ and functions as ψij → bij /(bi bj ) and ψi → bi , where µ∞ is i→j i→j the message of the ﬁxed point. [sent-192, score-0.273]

70 ′ Since det(I − T (η ∞ )) = det(I − UM), the formula in theorem 3 implies a direct link between ′ the linearization T (η ∞ ) and the local structure of the Bethe free energy. [sent-203, score-0.345]

71 From theorem 4, we have that a ﬁxed point of LBP is a local minimum of the Bethe free energy if Spec(T ′ (η ∞ )) ⊂ C \ R≥1 . [sent-204, score-0.407]

72 It is now clear that the condition for positive deﬁniteness, local stability of damped LBP and local stability of undamped LBP are given in terms of the set of eigenvalues, C \ R≥1 , {λ ∈ C|Reλ < 1} and {λ ∈ C||λ| < 1} respectively. [sent-205, score-0.225]

73 A locally stable ﬁxed point of sufﬁciently damped LBP is a local minimum of the Bethe free energy, because {λ ∈ C|Reλ < 1} is included in C \ R≥1 . [sent-206, score-0.342]

74 It is interesting to ask under which condition a local minimum of the Bethe free energy is a stable ﬁxed point of (damped) LBP. [sent-209, score-0.406]

75 While we do not know a complete answer, for an attractive model, which is deﬁned by Jij ≥ 0, the following theorem implies that if a stable ﬁxed point becomes unstable by changing Jij and hi , the corresponding local minimum also disappears. [sent-210, score-0.282]

76 t is a temperature: ψij (t) = exp(t−1 Jij xi xj ) and ψi (t) = exp(t−1 hi xi ). [sent-214, score-0.461]

77 If we continuously change t and see the LBP ﬁxed point becomes unstable across t = t0 , then the corresponding local minimum of the Bethe free energy becomes a saddle point across t = t0 . [sent-216, score-0.409]

78 From (3), we see bij (xi , xj ) ∝ exp(Jij xi xj + θi xi + θj xj ) for some θi and θj . [sent-218, score-0.849]

79 From Jij ≥ 0, we have Covbij [xi , xj ] = χij − mi mj ≥ 0, and thus ui→j ≥ 0. [sent-219, score-0.414]

80 Theorem 6 extends theorem 2 of [14], which discusses only the case of vanishing local ﬁelds hi = 0 and the trivial ﬁxed point (i. [sent-222, score-0.167]

81 6 6 Application to uniqueness of LBP ﬁxed point The uniqueness of LBP ﬁxed point is a concern of many studies, because the property guarantees that LBP ﬁnds the global minimum of the Bethe free energy if it converges. [sent-225, score-0.535]

82 If det ∇2 F (q) ̸= 0 for all q ∈ (∇F )−1 (0) then { ∑ ( ) 1 if x > 0, sgn det ∇2 F (q) = 1, where sgn(x) := −1 if x < 0. [sent-230, score-0.428]

83 Note that the set (∇F )−1 (0), which is the stationary points of the Bethe free energy, coincides with the ﬁxed points of LBP. [sent-232, score-0.176]

84 Deﬁne a sequence of manifolds {Cn } by Cn := {q ∈ side ∑ − L(G)| ij∈E xi ,xj log bij ≤ n}, which increasingly converges to L(G). [sent-256, score-0.296]

85 From (3), we see that bij (xi , xj ) ∝ exp(Jij xi xj + θi xi + θj xj ) for some θi and θj . [sent-268, score-0.849]

86 From theorem 7 and lemma 3, we can immediately obtain the uniqueness condition in [18], though the stronger contractive property is proved under the same condition in [18]. [sent-271, score-0.166]

87 ′ The interactions {Jij , hi } and {Jij , h′ } are said to be equivalent if there exists (si ) ∈ {±1}V such i ′ ′ that Jij = Jij si sj and hi = hi si . [sent-286, score-0.182]

88 Since an equivalent model is obtained by gauge transformation xi → xi si , the uniqueness property of LBP for equivalent models is unchanged. [sent-287, score-0.352]

89 Since the prime ˆ ˆ ˆ cycles of G bijectively correspond to those of G (in ﬁgure 2), we have det(I−BM) = det(I−B M), ˆe = β12 β23 , βe = β13 , and βe = β34 . [sent-299, score-0.125]

90 For graphs with multiple cycles, all the existing results on uniqueness make assumptions that upperbound |Jij | essentially. [sent-303, score-0.12]

91 In contrast, corollary 4 applies to arbitrary strength of interactions if the graph has two cycles and the interactions are not attractive. [sent-304, score-0.181]

92 It is noteworthy that, from corollary 2, the Bethe free energy is non-convex in the situation of corollary 4, while the ﬁxed point is unique. [sent-305, score-0.384]

93 7 Concluding remarks For binary pairwise models, we show the connection between the edge zeta function and the Bethe free energy in theorem 3, in the proof of which the multi-variable version of Ihara’s formula (theorem 2) is essential. [sent-306, score-0.754]

94 Some recent researches on LBP have suggested the importance of zeta function. [sent-309, score-0.286]

95 In the context of the LDPC code, which is an important application of LBP, Koetter et al [21, 22] show the connection between pseudo-codewords and the edge zeta function. [sent-310, score-0.346]

96 On the LBP for the Gaussian graphical model, Johnson et al [23] give zeta-like product formula of the partition function. [sent-311, score-0.115]

97 Loopy belief propagation for approximate inference: An empirical study. [sent-323, score-0.127]

98 On the uniqueness of loopy belief propagation ﬁxed points. [sent-341, score-0.291]

99 Stable ﬁxed points of loopy belief propagation are minima of the Bethe free energy. [sent-345, score-0.384]

100 On the properties of the Bethe approximation and loopy belief propagation on binary networks. [sent-388, score-0.197]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lbp', 0.629), ('bethe', 0.333), ('zeta', 0.286), ('det', 0.19), ('ij', 0.179), ('mi', 0.172), ('ue', 0.16), ('xj', 0.148), ('bij', 0.147), ('free', 0.144), ('jij', 0.14), ('energy', 0.132), ('xi', 0.129), ('um', 0.104), ('spec', 0.1), ('formula', 0.098), ('uniqueness', 0.094), ('mj', 0.094), ('ihara', 0.08), ('hessian', 0.072), ('loopy', 0.07), ('bi', 0.069), ('cycles', 0.068), ('propagation', 0.067), ('bm', 0.065), ('belief', 0.06), ('prime', 0.057), ('hi', 0.055), ('theorem', 0.051), ('stable', 0.05), ('sgn', 0.048), ('cosh', 0.046), ('fgibbs', 0.046), ('stability', 0.045), ('edge', 0.043), ('corollary', 0.041), ('cn', 0.04), ('damped', 0.04), ('heskes', 0.04), ('determinant', 0.039), ('graph', 0.038), ('xed', 0.037), ('di', 0.036), ('local', 0.035), ('damping', 0.034), ('koetter', 0.034), ('ni', 0.034), ('supplementary', 0.034), ('message', 0.033), ('locally', 0.028), ('minima', 0.027), ('unstable', 0.027), ('directed', 0.027), ('point', 0.026), ('graphs', 0.026), ('niteness', 0.025), ('positive', 0.025), ('tanh', 0.024), ('messages', 0.024), ('covbij', 0.023), ('furtlehner', 0.023), ('pakzad', 0.023), ('perron', 0.023), ('tachikawa', 0.023), ('uel', 0.023), ('umz', 0.023), ('vontobel', 0.023), ('ui', 0.022), ('ei', 0.022), ('uj', 0.021), ('lemma', 0.021), ('eigenvalues', 0.021), ('cycle', 0.021), ('japan', 0.02), ('manifolds', 0.02), ('mooij', 0.02), ('pseudomarginals', 0.02), ('minimum', 0.019), ('closed', 0.019), ('attractive', 0.019), ('pearl', 0.018), ('gibbs', 0.018), ('geometry', 0.018), ('interactions', 0.017), ('al', 0.017), ('reproduces', 0.017), ('linearization', 0.017), ('undirected', 0.017), ('kl', 0.017), ('points', 0.016), ('geodesics', 0.016), ('ek', 0.016), ('derives', 0.016), ('tree', 0.016), ('edges', 0.016), ('ik', 0.016), ('fukumizu', 0.016), ('tokyo', 0.016), ('pij', 0.016), ('material', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

2 0.11378017 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We ﬁrst derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to ﬁnd a user-deﬁned embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

3 0.10255559 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a ﬁxed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a ﬁxed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classiﬁers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classiﬁers, or the discriminative classiﬁers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classiﬁcation performance that at least matches the performance of the free energy classiﬁer based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classiﬁers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

4 0.092209771 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

Author: Arthur Choi, Adnan Darwiche

Abstract: We introduce a new perspective on approximations to the maximum a posteriori (MAP) task in probabilistic graphical models, that is based on simplifying a given instance, and then tightening the approximation. First, we start with a structural relaxation of the original model. We then infer from the relaxation its deﬁciencies, and compensate for them. This perspective allows us to identify two distinct classes of approximations. First, we ﬁnd that max-product belief propagation can be viewed as a way to compensate for a relaxation, based on a particular idealized case for exactness. We identify a second approach to compensation that is based on a more reﬁned idealized case, resulting in a new approximation with distinct properties. We go on to propose a new class of algorithms that, starting with a relaxation, iteratively seeks tighter approximations. 1

5 0.091222048 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

Abstract: We consider the problem of ﬁnding the M assignments with maximum probability in a probabilistic graphical model. We show how this problem can be formulated as a linear program (LP) on a particular polytope. We prove that, for tree graphs (and junction trees in general), this polytope has a particularly simple form and diﬀers from the marginal polytope in a single inequality constraint. We use this characterization to provide an approximation scheme for non-tree graphs, by using the set of spanning trees over such graphs. The method we present puts the M -best inference problem in the context of LP relaxations, which have recently received considerable attention and have proven useful in solving diﬃcult inference problems. We show empirically that our method often ﬁnds the provably exact M best conﬁgurations for problems of high tree-width. A common task in probabilistic modeling is ﬁnding the assignment with maximum probability given a model. This is often referred to as the MAP (maximum a-posteriori) problem. Of particular interest is the case of MAP in graphical models, i.e., models where the probability factors into a product over small subsets of variables. For general models, this is an NP-hard problem [11], and thus approximation algorithms are required. Of those, the class of LP based relaxations has recently received considerable attention [3, 5, 18]. In fact, it has been shown that some problems (e.g., ﬁxed backbone protein design) can be solved exactly via sequences of increasingly tighter LP relaxations [13]. In many applications, one is interested not only in the MAP assignment but also in the M maximum probability assignments [19]. For example, in a protein design problem, we might be interested in the M amino acid sequences that are most stable on a given backbone structure [2]. In cases where the MAP problem is tractable, one can devise tractable algorithms for the M best problem [8, 19]. Speciﬁcally, for low tree-width graphs, this can be done via a variant of max-product [19]. However, when ﬁnding MAPs is not tractable, it is much less clear how to approximate the M best case. One possible approach is to use loopy max-product to obtain approximate max-marginals and use those to approximate the M best solutions [19]. However, this is largely a heuristic and does not provide any guarantees in terms of optimality certiﬁcates or bounds on the optimal values. LP approximations to MAP do enjoy such guarantees. Speciﬁcally, they provide upper bounds on the MAP value and optimality certiﬁcates. Furthermore, they often work for graphs with large tree-width [13]. The goal of the current work is to leverage the power of LP relaxations to the M best case. We begin by focusing on the problem of ﬁnding the second best solution. We show how it can be formulated as an LP over a polytope we call the “assignment-excluding marginal polytope”. In the general case, this polytope may require an exponential number of inequalities, but we prove that when the graph is a tree it has a very compact representation. We proceed to use this result to obtain approximations to the second best problem, and show how these can be tightened in various ways. Next, we show how M best assignments can be found by relying on algorithms for 1 second best assignments, and thus our results for the second best case can be used to devise an approximation algorithm for the M best problem. We conclude by applying our method to several models, showing that it often ﬁnds the exact M best assignments. 1 The M-best MAP problem and its LP formulation Consider a function on n variables deﬁned as: f (x1 , . . . , xn ; θ) = θij (xi , xj ) + ij∈E θi (xi ) (1) i∈V where V and E are the vertices and nodes of a graph G with n nodes. We shall be interested in the M assignments with largest f (x; θ) value.1 Denote these by x(1) , . . . , x(M) , so that x(1) is the assignment that maximizes f (x; θ), x(2) is the 2nd best assignment, etc. The MAP problem (i.e., ﬁnding x(1) ) can be formulated as an LP as follows [15]. Let µ be a vector of distributions that includes {µij (xi , xj )}ij∈E over edge variables and {µi (xi )}i∈V over nodes. The set of µ that arise from some joint distribution is known as the marginal polytope [15] and is denoted by M(G). Formally: M(G) = {µ | ∃p(x) ∈ ∆ s.t. p(xi , xj ) = µij (xi , xj ) , p(xi ) = µi (xi )} . where ∆ is the set of distributions on x. The MAP problem can then be shown to be equivalent to the following LP:2 max f (x; θ) = max µ · θ , (2) x µ∈M(G) It can be shown that this LP always has a maximizing µ that is a vertex of M(G) and is integral. Furthermore, this µ corresponds to the MAP assignment x(1) . Although the number of variables in this LP is only O(|E| + |V |), the diﬃculty comes from an exponential number of linear inequalities generally required to describe the marginal polytope M(G). We shall ﬁnd it useful to deﬁne a mapping between assignments x and integral vertices of the polytope. Given an integral vertex v ∈ M(G), deﬁne x(v) to be the assignment that maximizes vi (xi ). And, given an assignment z deﬁne v(z) to be the integral vertex in M(G) corresponding to the assignment z. Thus the LP in Eq. 2 will be maximized by v(x(1) ). One simple outer bound of the marginal polytope is the local polytope ML (G), which only enforces pairwise constraints between variables:     µi (xi ) = 1 (3) µij (xi , xj ) = µj (xj ), µij (xi , xj ) = µi (xi ), ML (G) = µ ≥ 0   x x x i j i The LP relaxation is then to maximize µ · θ where µ ∈ ML (G). For tree structured graphs, ML (G) = M(G) [15] and thus the LP relaxation yields the exact MAP x(1) . An LP Formulation for the 2nd -best MAP 2 Assume we found the MAP assignment x(1) and are now interested in ﬁnding x(2) . Is there a simple LP whose solution yields x(2) ? We begin by focusing on the case where G is a tree so that the local LP relaxation is exact. We ﬁrst treat the case of a connected tree. To construct an LP whose solution is x(2) , a natural approach is to use the LP for x(1) (i.e., the LP in Eq. 2) but somehow eliminate the solution x(1) using additional constraints. This, however, is somewhat trickier than it sounds. The key diﬃculty is that the new constraints should not generate fractional vertices, so that the resulting LP is still exact. We begin by deﬁning the polytope over which we need to optimize in order to obtain x(2) . 1 2 This is equivalent to ﬁnding P maximum probability assignments for a model p(x) ∝ ef (x;θ) . the P P P We use the notation µ · θ = ij∈E xi ,xj µij (xi , xj )θij (xi , xj ) + i xi µi (xi )θi (xi ) 2 Deﬁnition 1. The assignment-excluding marginal polytope is deﬁned as: ˆ M(G, z) = {µ | ∃p(x) ∈ ∆ s.t. p(z) = 0, p(xi , xj ) = µij (xi , xj ), p(xi ) = µi (xi )} . ˆ M(G, z) is simply the convex hull of all (integral) vectors v(x) for x = z. (4) ˆ The following result shows that optimizing over M(G, x(1) ) will yield the second best soluˆ tion x(2) , so that we refer to M(G, x(1) ) as the second-best marginal polytope. Lemma 1. The 2nd best solution is obtained via the following LP: maxx=x(1) f (x; θ) = maxµ∈M(G,x(1) ) µ · θ. Furthermore, the µ that maximizes the LP on ˆ the right is integral and corresponds to the second-best MAP assignment x(2) . The proof is similar to that of Eq. 2: instead of optimizing over x, we optimize over distributions p(x), while enforcing that p(x(1) ) = 0 so that x(1) is excluded from the maximization. The key question which we now address is how to obtain a simple characterization of ˆ ˆ M(G, z). Intuitively, it would seems that M(G, z) should be “similar” to M(G), such that it can be described as M(G) plus some constraints that “block” the assignment z. To illustrate the diﬃculty in ﬁnding such “blocking” constraints, consider the following constraint, originally suggested by Santos [10]: i µi (zi ) ≤ n − 1. This inequality is not satisﬁed by µ = v(z) since v(z) attains the value n for the LHS of the above. Furthermore, for any x = z and µ = v(x), the LHS would be n − 1 or less. Thus, this inequality separates ˆ v(z) from all other integral vertices. One might conclude that we can deﬁne M(G, z) by adding this inequality to M(G). The diﬃculty is that the resulting polytope has fractional vertices,3 and maximizing over it won’t generally yield an integral solution. It turns out that there is a diﬀerent inequality that does yield an exact characterization of ˆ M(G, z) when G is a tree. We now deﬁne this inequality and state our main theorem. Deﬁnition 2. Consider the functional I(µ, z) (which is linear in µ): (1 − di )µi (zi ) + I(µ, z) = i µij (zi , zj ) (5) ij∈E where di is the degree of node i in the tree graph G. ˆ Theorem 1. Adding the single inequality I(µ, z) ≤ 0 to M(G) yields M(G, z). ˆ M(G, z) = {µ | µ ∈ M(G), I(µ, z) ≤ 0 } (6) The theorem is proved in the appendix. Taken together with Lemma 1, it implies that x(2) may be obtained via an LP that is very similar to the MAP-LP, but has an additional constraint. We note the interesting similarity between I(µ, z) and the Bethe entropy [20]. The only diﬀerence is that in Bethe, µi , µij are replaced by H(Xi ), H(Xi , Xj ) respectively.4 The theorem also generalizes to the case where G is not a tree, but we have a junction tree for G. In this case, the theorem still holds if we deﬁne a generalized I(µ, z) inequality as: (1 − dS )µS (zS ) + S∈S µC (zC ) ≤ 0 (7) C∈C where C and S are the junction tree cliques and their separators, respectively, and dS is the number of cliques that intersect on separator S. In this case, the marginal polytope should enforce consistency between marginals µC (zC ) and their separators µS (zS ). However, such a characterization requires variables whose cardinality is exponential in the tree-width and is thus tractable only for graphs of low tree-width. In the next section, we address approximations for general graphs. A corresponding result exists for the case when G is a forest. In this case, the inequality in Eq. 6 is modiﬁed to: I(µ, z) ≤ |P | − 1, where |P | denotes the number of connected components of G. Interestingly, for a graph without edges, this gives the Santos inequality. 3 Consider the case of a single edge between 2 nodes where the MAP assignment is (0, 0). Adding the inequality µ1 (0) + µ2 (0) ≤ 1 produces the fractional vertex (0.5, 0.5). 4 The connection to Bethe can be more clearly understood from a duality-based proof of Theorem 1. We will cover this in an extended version of the manuscript. 3 2nd best LPs for general graphs - Spanning tree inequalities 3 When the graph G is not a tree, the marginal polytope M(G) generally requires an exponential number of inequalities. However, as mentioned above, it does have an exact description in terms of marginals over cliques and separators of a junction tree. Given such marginals on ˆ junction tree cliques, we also have an exact characterization of M(G, z) via the constraint in Eq. 7. However, in general, we cannot aﬀord to be exponential in tree-width. Thus a common strategy [15] is to replace M(G) with an outer bound that enforces consistency between marginals on overlapping sets of variables. The simplest example is ML (G) in Eq. 3. ˆ In what follows, we describe an outer-bound approximation scheme for M(G, z). We use ML (G) as the approximation for M(G) (more generally ML (G) can enforce consistency between any set of small regions, e.g., triplets). When G is not a tree, the linear constraint in ˆ Eq. 6 will no longer suﬃce to derive M(G, z). Moreover, direct application of the inequality will incorrectly remove some integral vertices. An alternative approach is to add inequalities that separate v(z) from the other integral vertices. This will serve to eliminate more and more fractional vertices, and if enough constraints are added, this may result in an integral solution. One obvious family of such constraints are those corresponding to spanning trees in G and have the form of Eq. 5. Deﬁnition 3. Consider any T that is a spanning tree of G. Deﬁne the functional I T (µ, z): (1 − dT )µi (zi ) + i I T (µ, z) = i µij (zi , zj ) (8) ij∈T where dT is the degree of i in T . We refer to I T (µ, z) ≤ 0 as a spanning tree inequality. i For any sub-tree T of G, the corresponding spanning tree inequality separates the vertex v(z) from the other vertices. This can be shown via similar arguments as in the proof of Theorem 1. Note, however, that the resulting polytope may still have fractional vertices. The above argument shows that any spanning tree provides a separating inequality for ˆ M(G, z). In principle, we would like to use as many such inequalities as possible. Deﬁnition 4. The spanning tree assignment-excluding marginal polytope is deﬁned as: ˆ MST (G, z) = µ | µ ∈ ML (G), L ∀ tree T ⊆ E I T (µ, z) ≤ 0 (9) where the ST notation indicates the inclusion of all spanning tree inequalities for G.5 Thus, we would actually like to perform the following optimization problem: max ˆ µ∈MST (G,z) L µ·θ ˆ as an approximation to optimization over M(G, z); i.e., we seek the optimal µ subject to all spanning tree inequalities for G with the ambition that this µ be integral and thus provide the non-z MAP assignment, with a certiﬁcate of optimality. Although the number of spanning trees is exponential in n, it turns out that all spanning inequalities can be used in practice. One way to achieve this is via a cutting plane algorithm [12] that ﬁnds the most violated spanning tree inequality and adds it to the LP. To implement this eﬃciently, we note that for a particular µ and a spanning tree T , the value of I T (µ, z) can be decomposed into a sum over the edges in T (and a T -independent constant): I T (µ, z) = µi (zi ) µij (zi , zj ) − µi (zi ) − µj (zj ) + (10) i ij∈T The tree maximizing the above is the maximum-weight spanning tree with edge-weights wij = µij (zi , zj ) − µi (zi ) − µj (zj ). It can thus be found eﬃciently. The cutting plane algorithm proceeds as follows. We start by adding an arbitrary spanning tree. Then, as long as the optimal µ is fractional, we ﬁnd the spanning tree inequality that µ most violates (where this is implemented via the maximum-weight spanning tree). This constraint will necessarily remove µ from the polytope. If there are no violated inequalities 5 ˆ ˆL Note that M(G, z) ⊆ MST (G, z) ⊂ ML (G). 4 but µ is still fractional, then spanning tree inequalities do not suﬃce to ﬁnd an integral solution (but see below on hypertree constraints to add in this case). In practice, we found that only a relatively small number of inequalities are needed to successfully yield an integral solution, or determine that all such inequalities are already satisﬁed. An alternative approach for solving the all spanning-tree problem is to work via the dual. The dual variables roughly correspond to points in the spanning tree polytope [16], optimization over which can be done in polynomial time, e.g., via the ellipsoid algorithm. We do not pursue this here since the cutting plane algorithm performed well in our experiments. ˆ As mentioned earlier, we can exactly characterize M(G, z) using Eq. 7, albeit at a cost exponential in the tree-width of the graph. A practical compromise would be to use inequalities over clique trees of G, where the cliques are relatively small, e.g., triplets. The corresponding constraint (Eq. 7 with the small cliques and their separators) will necessarily separate v(z) from the other integral vertices. Finding the maximally violated such inequality is an NP-hard problem, equivalent to a prize collecting Steiner tree problem, but recent work has found that such problems are often exactly solvable in practice [7]. It thus might be practical to include all such trees as constraints using a cutting plane algorithm. 4 From 2nd -best to M-best Thus far, we only dealt with the 2nd best case. As we show now, it turns out that the 2nd -best formalism can be used to devise an algorithm for M best. We begin by describing an algorithm for the exact M best and then show how it can be used to approximate those via the approximations for 2nd best described above. Fig. 1 describes our scheme, which we call Partitioning for Enumerating Solutions (or PES) for solving the M best problem. The scheme is general and only assumes that MAP-“like” problems can be solved. It is inspired by several pre-existing M best solution schemes [4, 6, 8, 19] but diﬀers from them in highlighting the role of ﬁnding a second best solution within a given subspace. for m ← 1 to M do if m = 1 then Run MAP solver to obtain the best assignment: x(1) ≡ arg max f (x; θ) CONSTRAINTS1 ← ∅ else k ←− arg max ′ k′ ∈{1,...,m−1} f (y(k ) ; θ) // sub-space containing mth best assignment x(m) ← y(k) // mth best assignment // A variable choice that distinguishes x(m) from x(k) : (m) (v, a) ← any member of the set {(i, xi (m) ) : xi (k) = xi } CONSTRAINTSm ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(k) (as MAP) from subspace m CONSTRAINTSk ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(m) (as 2nd -best) from subspace k y(k) ← CalcNextBestSolution(CONSTRAINTSk , x(k) ) end y(m) ← CalcNextBestSolution(CONSTRAINTSm , x(m) ) end return {x(m) }M m=1 /* Find next best solution in sub-space defined by CONSTRAINTS */ Function CalcNextBestSolution(CONSTRAINTS, x(∗) ) // x(∗) is the MAP in the sub-space defined by CONSTRAINTS: Run MAP solver to obtain the second-best solution: y ≡ arg max f (x; θ), and return y. x=x(∗) ,CONSTRAINTS end Figure 1: Pseudocode for the PES algorithm. The modus operandi of the PES algorithm is to eﬃciently partition the search space while systematically excluding all previously determined assignments. Signiﬁcantly, any MAP 5 Attractive Grids Ranks Run-times 1 50 Mixed Grids Ranks Run-times 1 50 0.5 0 S N B 0 Hard Protein SCP Ranks Run-times 1 50 0.5 S N B 0 0 S+R N+R B+R 0.5 S+R N+R B+R 0 S+R B B+R 0 S+R B B+R Figure 2: Number of best ranks and normalized run-times for the attractive and mixed grids, and the more diﬃcult protein SCP problems. S, N, and B denote the STRIPES, Nilsson, and BMMF algorithms. Algorithms marked with +R denote that regions of variables were added for those runs. solver can be plugged into it, on the condition that it is capable of solving the arg max in the CalcNextBestSolution subroutine. The correctness of PES can be shown by observing that at the M th stage, all previous best solutions are excluded from the optimization and no other assignment is excluded. Of note, this simple partitioning scheme is possible due to the observation that the ﬁrst-best and second-best MAP assignments must diﬀer in the assignment of at least one variable in the graph. The main computational step of the PES algorithm is to maximize f (x; θ) subject to x = x(∗) and x ∈ CONSTRAINTS (see the CalcNextBestSolution subroutine). The CONSTRAINTS set merely enforces that some of the coordinates of x are either equal to or diﬀerent from speciﬁed values.6 Within the LP, these can be enforced by setting µi (xi = a) = 1 or µi (xi = a) = 0. It can be shown that if one optimizes µ · θ with ˆ these constraints and µ ∈ M(G, x(∗) ), the solution is integral. Thus, the only element ˆ requiring approximation in the general case is the description of M(G, x(∗) ). We choose as ˆ this approximation the polytope MST (G, x(∗) ) in Eq. 9. We call the resulting approximaL tion algorithm Spanning TRee Inequalities and Partitioning for Enumerating Solutions, or STRIPES. In the next section, we evaluate this scheme experimentally. 5 Experiments We compared the performance of STRIPES to the BMMF algorithm [19] and the Lawler/Nilsson algorithm [6, 8]. Nilsson’s algorithm is equivalent to PES where the 2nd best assignment is obtained from maximizations within O(n) partitions, so that its runtime is O(n) times the cost of ﬁnding a single MAP. Here we approximated each MAP with its LP relaxation (as in STRIPES), so that both STRIPES and Nilsson come with certiﬁcates of optimality when their LP solutions are integral. BMMF relies on loopy BP to approximate the M best solutions.7 We used M = 50 in all experiments. To compare the algorithms, we pooled all their solutions, noting the 50 top probabilities, and then counted the fraction of these that any particular algorithm found (its solution rank). For run-time comparisons, we normalized the times by the longest-running algorithm for each example. We begin by considering pairwise MRFs on binary grid graphs of size 10 × 10. In the ﬁrst experiment, we used an Ising model with attractive (submodular) potentials, a setting in which the pairwise LP relaxation is exact [14]. For each grid edge ij, we randomly chose Jij ∈ [0, 0.5], and local potentials were randomized in the range ±0.5. The results for 25 graphs are shown in Fig. 2. Both the STRIPES and Nilsson algorithms obtained the 50 optimal solutions (as learned from their optimality certiﬁcates), while BMMF clearly fared less well for some of the graphs. While the STRIPES algorithm took < 0.5 to 2 minutes to run, the Nilsson algorithm took around 13 minutes. On the other hand, BMMF was quicker, taking around 10 seconds per run, while failing to ﬁnd a signiﬁcant portion of the top solutions. Overall, the STRIPES algorithm was required to employ up to 19 spanning tree inequalities per calculation of second-best solution. 6 This is very diﬀerent from the second best constraint, since setting x1 = 1 blocks all assignments with this value, as opposed to setting x = 1 which blocks only the assignment with all ones. 7 For BMMF, we used the C implementation at http://www.cs.huji.ac.il/~ talyam/ inference.html. The LPs for STRIPES and Nilsson were solved using CPLEX. 6 Next, we studied Ising models with mixed interaction potentials (with Jij and the local potentials randomly chosen in [−0.5, 0.5]). For almost all of the 25 models, all three algorithms were not able to successfully ﬁnd the top solutions. Thus, we added regions of triplets (two for every grid face) to tighten the LP relaxation (for STRIPES and Nilsson) and to perform GBP instead of BP (for BMMF). This resulted in STRIPES and Nilsson always provably ﬁnding the optimal solutions, and BMMF mostly ﬁnding these solutions (Fig. 2). For these more diﬃcult grids, however, STRIPES was the fastest of the algorithms, taking 0.5 - 5 minutes. On the other hand, the Nilsson and BMMF algorithms took 18 minutes and 2.5 7 minutes, respectively. STRIPES added up to 23 spanning tree inequalities per iteration. The protein side-chain prediction (SCP) problem is to to predict the placement of amino acid side-chains given a protein backbone [2, 18]. Minimization of a protein energy function corresponds to ﬁnding a MAP assignment for a pairwise MRF [19]. We employed the dataset of [18] (up to 45 states per variable, mean approximate tree-width 50), running all algorithms to calculate the optimal side-chain conﬁgurations. For 315 of 370 problems in the dataset, the ﬁrst MAP solution was obtained directly as a result of the LP relaxation having an integral solution (“easy” problems). STRIPES provably found the subsequent top 50 solutions within 4.5 hours for all but one of these cases (up to 8 spanning trees per calculation), and BMMF found the same 50 solutions for each case within 0.5 hours; note that only STRIPES provides a certiﬁcate of optimality for these solutions. On the other hand, only for 146 of the 315 problems was the Nilsson method able to complete within ﬁve days; thus, we do not compare its performance here. For the remaining 55 (“hard”) problems (Fig. 2), we added problem-speciﬁc triplet regions using the MPLP algorithm [13]. We then ran the STRIPES algorithm to ﬁnd the optimal solutions. Surprisingly, it was able to exactly ﬁnd the 50 top solutions for all cases, using up to 4 standard spanning tree inequalities per second-best calculation. The STRIPES run-times for these problems ranged from 6 minutes to 23 hours. On the other hand, whether running BMMF without these regions (BP) or with the regions (GBP), it did not perform as well as STRIPES in terms of the number of high-ranking solutions or its speed. To summarize, STRIPES provably found the top 50 solutions for 369 of the 370 protein SCP problems. 6 Conclusion ˆ In this work, we present a novel combinatorial object M(G, z) and show its utility in obtaining the M best MAP assignments. We provide a simple characterization of it for tree structured graphs, and show how it can be used for approximations in non-tree graphs. As with the marginal polytope, many interesting questions arise about the properties of ˆ M(G, z). For example, in which non-tree cases can we provide a compact characterization (e.g., as for the cut-polytope for planar graphs [1]). Another compelling question is in which problems the spanning tree inequalities are provably optimal. An interesting generalization of our method is to predict diverse solutions satisfying some local measure of “distance” from each other, e.g., as in [2]. Here we studied the polytope that results from excluding one assignment. An intriguing question is to characterize the polytope that excludes M assignments. We have found that it does not simply correspond to adding M constraints I(µ, z i ) ≤ 0 for i = 1, . . . , M , so its ˆ geometry is apparently more complicated than that of M(G, z). Here we used LP solvers to solve for µ. Such generic solvers could be slow for large-scale problems. However, in recent years, specialized algorithms have been suggested for solving MAP-LP relaxations [3, 5, 9, 17]. These use the special form of the constraints to obtain local-updates and more scalable algorithms. We intend to apply these schemes to our method. Finally, our empirical results show that our method indeed leverages the power of LP relaxations and yields exact M best optimal solutions for problems with large tree-width. Acknowledgements We thank Nati Linial for his helpful discussions and Chen Yanover and Talya Meltzer for their insight and help in running BMMF. We also thank the anonymous reviewers for their useful advice. 7 A Proof of Theorem 1 Recall that for any µ ∈ M(G), there exists a probability density p(x) s.t. µ = x p(x)v(x). Denote pµ (z) as the minimal value of p(z) among all p(x) that give µ. We prove that ˆ pµ (z) = max(0, I(µ, z)), from which the theorem follows (since pµ (z) = 0 iﬀ µ ∈ M(G, z)). The proof is by induction on n. For n = 1, the node has degree 0, so I(µ, z) = µ1 (z1 ). Clearly, pµ (z) = µ1 (z1 ), so pµ (z) = I(µ, z). For n > 1, there must exist a leaf in G ˆ (assume that its index is n and its neighbor’s is n − 1). Denote G as the tree obtained ˆ by removing node n and its edge with n − 1. For any assignment x, denote x as the corresponding sub-assignment for the ﬁrst n − 1 variables. Also, any µ can be derived by ˆ ˆ adding appropriate coordinates to a unique µ ∈ M(G). For an integral vertex µ = v(x), ˆˆ ˆ ˆ ˆ ˆ x denote its projected µ as v (ˆ ). Denote by I(µ, z ) the functional in Eq. 5 applied to G. For ˆ any µ and its projected µ, it can be seen that: ˆˆ ˆ I(µ, z) = I(µ, z ) − α (11) where we deﬁne α = xn =zn µn−1,n (zn−1 , xn ) (so 0 ≤ α ≤ 1). The inductive assumption ˆ ˆ ˆ gives a p(ˆ ) that has marginals µ and also p(ˆ ) = max(0, I(µ, z )). We next use p(ˆ ) to ˆx ˆz ˆx construct a p(x) that has marginals µ and the desired minimal pµ (z). Consider three cases: ˆˆ ˆ I. I(µ, z) ≤ 0 and I(µ, z ) ≤ 0. From the inductive assumption, pµ (ˆ ) = 0, so we deﬁne: ˆˆ z µn−1,n (xn−1 , xn ) p(x) = p(ˆ ) ˆx (12) µn−1 (xn−1 ) which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0 as required. If µn−1 (xn−1 ) = 0, then p(ˆ ) is necessarily 0, in which case we deﬁne p(x) = 0. Note that this construction ˆx is identical to that used in proving that ML (G) = M(G) for a tree graph G. ˆˆ ˆ II. I(µ, z) > 0. Based on Eq. 11 and α ≥ 0, we have I(µ, z ) > 0. Applying the inductive ˆ µ, z ) = pµ (ˆ ) > 0. Now, deﬁne p(x) so that p(z) = I(µ, z): ˆ assumption to µ, we obtain I( ˆ ˆ ˆˆ z xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) p(x) no constraint 0 no constraint As in Eq. 12 0 0 ∃ l x l = zl 1 ∀ l x l = zl 1 µn−1,n (zn−1 , xn ) 1 1 p(ˆ ) ˆx 0 I(µ, z) Simple algebra shows that p(x) is non-negative and has µ as marginals. We now show that p(z) is minimal. Based on the inductive assumption and Eq. 11, it can easily be shown that I(v(z), z) = 1, I(v(x), z) ≤ 0 for x = z. For any p(x) s.t. µ = x p(x)v(x), from linearity, I(µ, z) = p(z) + x=z p(x)I(v(x), z) ≤ p(z) (since I(v(x), z) ≤ 0 for x = z). Since the p(z) we deﬁne achieves this lower bound, it is clearly minimal. ˆˆ ˆ ˆ III. I(µ, z) ≤ 0 but I(µ, z ) > 0. Applying the inductive assumption to µ, we see that ˆ µ, z ) > 0; Eq. 11 implies α − I(µ, z ) ≥ 0. Deﬁne β = µn−1 (zn−1 ) − pµ (ˆ ), which ˆˆ ˆ ˆˆ z pµ (ˆ ) = I( ˆ ˆ ˆˆ z ˆ is non-negative since µn−1 (zn−1 ) = µn−1 (ˆ n−1 ) and p marginalizes to µ. Deﬁne p(x) as: ˆ z ˆ xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) no constraint 0 no constraint ∃ l x l = zl As in Eq. 12 0 ˆ ˆ z µ (z ,x ) p(ˆ ) n−1,n βn−1 n α−I(µ,ˆ ) ˆx α µ (z ,z ) p(ˆ ) n−1,n βn−1 n ˆx (z ,x ) ˆˆ ˆ µ I(µ, z ) n−1,n αn−1 n 1 0 0 1 1 ∀ l x l = zl p(x) 1 which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0, as required. 8 References [1] F. Barahona. On cuts and matchings in planar graphs. Math. Program., 60(1):53–68, 1993. [2] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics, 75:682–705, 2009. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2007. [4] E. Kloppmann, G. M. Ullmann, and T. Becker. An extended dead-end elimination algorithm to determine gap-free lists of low energy states. Journal of Comp. Chem., 28:2325–2335, 2007. [5] N. Komodakis and N. Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In D. Forsyth, P. Torr, and A. Zisserman, editors, ECCV, pages 806–820, Heidelberg, Germany, 2008. Springer. [6] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18(7):401–405, 1972. [7] I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Mathematical Programming, 105:427–449, Feb 2006. [8] D. Nilsson. An eﬃcient algorithm for ﬁnding the M most probable conﬁgurations in probabilistic expert systems. Statistics and Computing, 8:159–173, Jun 1998. [9] P. Ravikumar, A. Agarwal, and M. Wainwright. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In Proc. of the 25th international conference on Machine learning, pages 800–807, New York, NY, USA, 2008. ACM. [10] E. Santos. On the generation of alternative explanations with implications for belief revision. In Proc. of the 7th Annual Conference on Uncertainty in Artiﬁcial Intelligence, 1991. [11] Y. Shimony. Finding the MAPs for belief networks is NP-hard. 68(2):399–410, 1994. Aritiﬁcal Intelligence, [12] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2007. [13] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proc. of the 24th Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 503–510, 2008. [14] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and bregman projections. J. Mach. Learn. Res., 7:1627–1653, 2006. [15] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008. [16] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005. [17] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1165–1179, 2007. [18] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Journal of Machine Learning Research, 7:1887–1907, 2006. [19] C. Yanover and Y. Weiss. Finding the M most probable conﬁgurations using loopy belief propagation. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [20] J. Yedidia, W. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282– 2312, 2005. 9

6 0.085129589 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

7 0.077826828 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

8 0.068784669 187 nips-2009-Particle-based Variational Inference for Continuous Systems

9 0.057679985 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

10 0.054451741 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

11 0.05402983 256 nips-2009-Which graphical models are difficult to learn?

12 0.048652478 129 nips-2009-Learning a Small Mixture of Trees

13 0.044124689 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

14 0.042165171 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

15 0.041721381 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

16 0.041412938 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

17 0.040193588 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

18 0.040148616 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

19 0.039638415 246 nips-2009-Time-Varying Dynamic Bayesian Networks

20 0.037345558 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.13), (1, 0.047), (2, -0.02), (3, 0.039), (4, -0.028), (5, -0.045), (6, 0.053), (7, 0.03), (8, -0.029), (9, -0.077), (10, -0.102), (11, 0.066), (12, 0.056), (13, -0.008), (14, 0.044), (15, -0.02), (16, -0.033), (17, -0.043), (18, 0.033), (19, -0.007), (20, -0.038), (21, -0.045), (22, 0.002), (23, -0.047), (24, 0.011), (25, 0.061), (26, 0.222), (27, 0.007), (28, -0.106), (29, 0.098), (30, 0.001), (31, 0.041), (32, 0.133), (33, 0.018), (34, -0.039), (35, 0.017), (36, -0.049), (37, 0.01), (38, -0.059), (39, -0.051), (40, 0.098), (41, 0.053), (42, 0.087), (43, -0.021), (44, 0.061), (45, 0.038), (46, 0.01), (47, 0.036), (48, -0.056), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96087605 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

2 0.7742877 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

Author: Arthur Choi, Adnan Darwiche

3 0.61629027 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

4 0.60373938 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

5 0.59035701 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

6 0.53005451 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

7 0.52041739 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

8 0.45040968 129 nips-2009-Learning a Small Mixture of Trees

9 0.44578466 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

10 0.43001771 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

11 0.42947185 187 nips-2009-Particle-based Variational Inference for Continuous Systems

12 0.39398512 256 nips-2009-Which graphical models are difficult to learn?

13 0.38799644 97 nips-2009-Free energy score space

14 0.38344851 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

15 0.37902877 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

16 0.37649417 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

17 0.37134206 7 nips-2009-A Data-Driven Approach to Modeling Choice

18 0.36487257 39 nips-2009-Bayesian Belief Polarization

19 0.36116764 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

20 0.35964042 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.016), (24, 0.055), (25, 0.068), (35, 0.059), (36, 0.083), (39, 0.045), (56, 0.304), (58, 0.08), (61, 0.023), (66, 0.013), (71, 0.047), (81, 0.017), (86, 0.057), (91, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77279258 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

2 0.70462996 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

Author: Samuel R. Bulò, Marcello Pelillo

Abstract: Hypergraph clustering refers to the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Traditional approaches to this problem are based on the idea of partitioning the input data into a user-deﬁned number of classes, thereby obtaining the clusters as a by-product of the partitioning process. In this paper, we provide a radically different perspective to the problem. In contrast to the classical approach, we attempt to provide a meaningful formalization of the very notion of a cluster and we show that game theory offers an attractive and unexplored perspective that serves well our purpose. Speciﬁcally, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player “clustering game”, whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. From the computational viewpoint, we show that the problem of ﬁnding the equilibria of our clustering game is equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization. Experiments are presented which show the superiority of our approach over state-of-the-art hypergraph clustering techniques.

3 0.49807537 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

Author: Kurt Miller, Michael I. Jordan, Thomas L. Griffiths

Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets. 1

4 0.49718642 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

Author: Lei Shi, Thomas L. Griffiths

Abstract: The goal of perception is to infer the hidden states in the hierarchical process by which sensory data are generated. Human behavior is consistent with the optimal statistical solution to this problem in many tasks, including cue combination and orientation detection. Understanding the neural mechanisms underlying this behavior is of particular importance, since probabilistic computations are notoriously challenging. Here we propose a simple mechanism for Bayesian inference which involves averaging over a few feature detection neurons which ﬁre at a rate determined by their similarity to a sensory stimulus. This mechanism is based on a Monte Carlo method known as importance sampling, commonly used in computer science and statistics. Moreover, a simple extension to recursive importance sampling can be used to perform hierarchical Bayesian inference. We identify a scheme for implementing importance sampling with spiking neurons, and show that this scheme can account for human behavior in cue combination and the oblique effect. 1

5 0.49608454 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

Author: Piyush Rai, Hal Daume

Abstract: Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efﬁcacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction. 1

6 0.49603042 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

7 0.49559864 113 nips-2009-Improving Existing Fault Recovery Policies

8 0.49351847 97 nips-2009-Free energy score space

9 0.4927288 187 nips-2009-Particle-based Variational Inference for Continuous Systems

10 0.49223149 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

11 0.49173012 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

12 0.49049711 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

13 0.49041039 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

14 0.4898017 3 nips-2009-AUC optimization and the two-sample problem

15 0.488877 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

16 0.48878863 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

17 0.48860773 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

18 0.48832884 100 nips-2009-Gaussian process regression with Student-t likelihood

19 0.48780739 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

20 0.48740661 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions