jmlr jmlr2013 jmlr2013-120 knowledge-graph by maker-knowledge-mining

120 jmlr-2013-Variational Algorithms for Marginal MAP

Source: pdf

Author: Qiang Liu, Alexander Ihler

Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms signiﬁcantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. [sent-5, score-0.265]

2 We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. [sent-6, score-0.605]

3 In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. [sent-7, score-0.501]

4 We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. [sent-8, score-0.503]

5 Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models 1. [sent-11, score-0.306]

6 Finally, the main focus of this work is on marginal MAP, a type of mixed-inference problem that seeks a partial conﬁguration of variables that maximizes those variables’ marginal probability, with the remaining variables summed out. [sent-19, score-0.292]

7 In practice, it is common to over-use the simpler joint MAP or marginalization even when marginal MAP would be more appropriate. [sent-37, score-0.265]

8 1 Contributions We reformulate the mixed-inference problem to a joint maximization problem as a free energy objective that extends the well-known log-partition function duality form, making it possible to easily extend essentially arbitrary variational algorithms to marginal MAP. [sent-40, score-0.464]

9 In particular, we propose a novel “mixed-product” BP algorithm that is a hybrid of max-product, sum-product, and a special “argmax-product” message updates, as well as a convergent proximal point algorithm that works by iteratively solving pure (or annealed) marginalization tasks. [sent-41, score-0.521]

10 2 Related Work Expectation-maximization (EM) or variational EM provide one straightforward approach for marginal MAP, by viewing the sum nodes as hidden variables and the max nodes as parameters to be estimated; however, EM is prone to getting stuck at sub-optimal conﬁgurations. [sent-53, score-0.525]

11 (2011) can be viewed as an approximation of the marginal MAP problem that exchanges the order of sum and max operators. [sent-67, score-0.244]

12 To the best of our knowledge, our work is the ﬁrst general variational framework for marginal MAP, and provides the ﬁrst strong optimality guarantees. [sent-72, score-0.297]

13 We then introduce a novel variational dual representation for marginal MAP in Section 3, and propose analogues of the Bethe and tree-reweighted approximations for marginal MAP in Section 4. [sent-74, score-0.443]

14 A class of “mixed-product” message passing algorithms is proposed and analyzed in Section 5 and convergent alternatives are proposed in Section 6 based on proximal point methods. [sent-75, score-0.457]

15 The factorization structure of p(x) can be represented by an undirected graph G = (V, E), where each node i ∈ V maps to a variable xi , and each edge (i j) ∈ E corresponds to two variables xi and x j that coappear in some factor function ψα , that is, {i, j} ⊆ α. [sent-88, score-0.263]

16 1 M ARGINAL P OLYTOPE The marginal polytope is a key concept in variational inference. [sent-98, score-0.33]

17 We deﬁne the marginal polytope M to be the set of local marginal probabilities τ = {τα (xα ) : α ∈ I } that are extensible to a valid joint distribution, that is, M = {τ : ∃ joint distribution q(x), s. [sent-99, score-0.451]

18 However, (1) provides a framework for deriving efﬁcient approximate inference algorithms by approximating both the marginal polytope and the entropy (Wainwright and Jordan, 2008). [sent-112, score-0.284]

19 The well-known loopy belief propagation (BP) algorithm of Pearl (1988) can be interpreted as a ﬁxed point algorithm to optimize the Bethe free energy in (3) on the locally consistent polytope L (Yedidia et al. [sent-123, score-0.328]

20 The tree reweighted (TRW) free energy is a convex surrogate of the Bethe free energy (Wainwright et al. [sent-126, score-0.332]

21 A message passing algorithm similar to loopy BP, called tree reweighted BP, can be derived as a ﬁxed point algorithm for solving the convex optimization in (4). [sent-131, score-0.328]

22 4 M EAN - FIELD - BASED M ETHODS Mean-ﬁeld-based methods are another set of approximate inference algorithms, which work by restricting M to a set of tractable distributions, on which both the marginal polytope and the joint entropy are tractable. [sent-134, score-0.335]

23 Note that the joint entropy H(τ) for any τ ∈ Mm f decomposes to the sum of singleton entropies Hi (τ) of the marginal distributions τi (xi ). [sent-136, score-0.301]

24 That is, x∗ = arg max θ(x), Φ∞ (θ) = max θ(x), x x where x∗ is a MAP conﬁguration and Φ∞ (θ) the optimal energy value. [sent-142, score-0.262]

25 The marginalization over xA destroys the conditional dependency structure in the marginal distribution p(xB ), causing an intractable maximization problem over xB . [sent-156, score-0.28]

26 The marginal MAP problem seeks a partial conﬁguration x∗ that has the maximum marginal probability p(xB ) = ∑xA p(x), where A is the set of B sum nodes to be marginalized out, and B the max nodes to be optimized. [sent-161, score-0.534]

27 To facilitate developing our duality results, we formulate marginal MAP in terms of the exponential family representation, ΦAB (θ) = max Q(xB ; θ), xB where Q(xB ; θ) = log ∑ exp[θ(x)], (7) xA where the maximum point x∗ of Q(xB ; θ) is the marginal MAP solution. [sent-163, score-0.363]

28 A classic example is shown in Figure 1, where marginal MAP is NP-hard even on a tree structured graph (Koller and Friedman, 2009). [sent-165, score-0.271]

29 The main difﬁculty arises because the max and sum operators do not commute, which restricts feasible elimination orders to those with all the sum nodes eliminated before any max nodes. [sent-166, score-0.293]

30 In the worst case, marginalizing the sum nodes xA may destroy any conditional independence among the max nodes xB , making it difﬁcult to represent or optimize Q(xB ; θ), even when the sum part alone is tractable (such as when the nodes in A form a tree). [sent-167, score-0.364]

31 Denote by xb ∈ {rainy, sunny} the weather condition of Irvine, and xa ∈ {walk, drive} whether Alice drives or walks to the school depending on the weather con3171 L IU AND I HLER dition. [sent-175, score-0.765]

32 Assume the probabilities of xb and xa are p(xb ) : rainy sunny p(xa |xb ) : 0. [sent-176, score-0.779]

33 The marginal MAP, xb = arg maxxb p(xb ) = sunny, gives the correct answer. [sent-179, score-0.667]

34 However, ∗ , x∗ ] = arg max p(x , x ) = [drive, rainy], gives answer x∗ = rainy (by the full MAP estimator, [xa b a b b ∗ dropping the xa component), which is obviously wrong. [sent-180, score-0.394]

35 In the above example, since no evidence on xa is observed, the conditional probability p(xa |xb ) does not provide useful information for xb , but instead provides misleading information when it is incorporated in the full MAP estimator. [sent-182, score-0.709]

36 The marginal MAP, on the other hand, eliminates the inﬂuence of the irrelevant p(xa |xb ) by marginalizing (or averaging) xa . [sent-183, score-0.384]

37 The marginal MAP energy ΦAB (θ) in (7) has a dual representation, ΦAB (θ) = max{ θ, τ + HA|B (τ)}, (8) τ∈M where HA|B (τ) is a conditional entropy, HA|B (τ) = − ∑x τ(x) log τ(xA |xB ). [sent-189, score-0.262]

38 Theorem 2 naturally integrates the marginalization and maximization sub-problems into one joint optimization problem, providing a novel and efﬁcient treatment for marginal MAP beyond the traditional approaches that treat the marginalization sub-problem as a sub-routine of the maximization problem. [sent-205, score-0.442]

39 As we show in Section 5, this enables us to derive efﬁcient “mixedproduct” message passing algorithms that simultaneously takes marginalization and maximization steps, avoiding expensive and possibly wasteful inner loop steps in the marginalization sub-routine. [sent-206, score-0.416]

40 Intuitively, since the entropy HB (τ) is removed from the objective, the optimal marginal τ∗ (xB ) tends to have lower entropy and its probability mass concentrates on the optimal conﬁgurations {x∗ }. [sent-211, score-0.264]

41 First, HB (τ) (and hence Fmix (τ, θ)) may be intractable to calculate even when the joint entropy H(τ) is tractable, because the marginal distribution p(xB ) = ∑xA p(x) does not necessarily inherit the conditional dependency structure of the joint distribution. [sent-215, score-0.261]

42 Variational Approximations for Marginal MAP Theorem 2 transforms the marginal MAP problem into a variational form, but obviously does not decrease its computational hardness. [sent-233, score-0.251]

43 Fortunately, many well-established variational techniques for sum- and max-inference can be extended to apply to (8), opening a new door for deriving novel approximate algorithms for marginal MAP. [sent-234, score-0.251]

44 Similar to the regular Bethe approximation, (11) leads to a nonconvex optimization, and we will derive both message passing algorithms and provably convergent algorithms to solve it. [sent-273, score-0.253]

45 B (ii) Suppose τ∗ is a global maximum of (14), and {τ∗ (xi ) : i ∈ B} have integral values, that is, i τ∗ (xi ) = 0 or 1, then {xi∗ = arg maxxi τ∗ (xi ) : i ∈ B} is a globally optimal solution of the i i marginal MAP problem (7). [sent-303, score-0.277]

46 This yields a generic pairwise free energy optimization problem, max τ∈L θ, τ + ∑ wi Hi (τ) − i∈V ∑ wi j Ii j (τ) , (16) (i j)∈E where the weights {wi , wi j } are determined by the temperature ε and {ρi j } via wi = 1 ε ∀i ∈ A ∀i ∈ B, wi j = ρi j ερi j ∀(i j) ∈ EA ∪ ∂AB ∀(i j) ∈ EB . [sent-320, score-0.565]

47 If wi = 1 for ∀i ∈ A and wi = 0 for ∀i ∈ B (and the corresponding wi j → 0+ ), Equation (16) corresponds to the marginal MAP problem; in the sequel, we derive “mixed-product” BP algorithms. [sent-327, score-0.326]

48 end for Calculate the singleton beliefs bi (xi ) and decode the solution x∗ , B xi∗ = arg max bi (xi ), ∀i ∈ B, where bi (xi ) ∝ ψi (xi )m∼i (xi ). [sent-342, score-0.706]

49 xi where m∼i (xi ) := ∏ mk→i (xi ) is the product of messages sent into node i, and ∂i is the set of k∈∂i neighboring nodes of i. [sent-343, score-0.252]

50 1 Mixed-Product Belief Propagation Directly taking ε → 0+ in message update (18), we can get an interesting “mixed-product” BP algorithm that is a hybrid of the max-product and sum-product message updates, with a novel “argmaxproduct” message update that is speciﬁc to marginal MAP problems. [sent-353, score-0.608]

51 k∈∂i end for end for Calculate the singleton beliefs bi (xi ) and decode the solution x∗ , B xi∗ = arg max bi (xi ), ∀i ∈ B, where bi (xi ) ∝ ψi (xi )m∼i (xi ). [sent-365, score-0.706]

52 xi Algorithm 2 has an intuitive interpretation: the sum-product and max-product messages in (20) and (21) correspond to the marginalization and maximization steps, respectively. [sent-366, score-0.314]

53 The special “argmax-product” messages in (22) serves to synchronize the sum-product and max-product messages—it restricts the max nodes to the currently decoded local marginal MAP solutions Xi∗ = arg max ψi (xi )m∼i (xi ), and passes the posterior beliefs back to the sum part. [sent-367, score-0.555]

54 This advantage is inherited from our general variational framework, which naturally integrates the marginalization and maximization sub-problems into a joint optimization problem. [sent-370, score-0.267]

55 To start, we deﬁne a set of “mixed-beliefs” as bi (xi ) ∝ ψi (xi )m∼i (xi ), bi j (xi j ) ∝ bi (xi )b j (x j ) ψi j (xi , x j ) mi→ j (x j )m j→i (xi ) 1/ρi j . [sent-381, score-0.513]

56 (23) The marginal MAP solution should be decoded from xi∗ ∈ arg maxxi bi (xi ), ∀i ∈ B, as is typical in max-product BP. [sent-382, score-0.423]

57 The {τi , τi j } in (19) and the {bi , bi j } in (23) are associated via, bi ∝ τi bi ∝ (τi )ε τ bi j ∝ bi b j ( τi iτj j ) ∀(i j) ∈ EA ∪ ∂AB bi j ∝ ∀i ∈ A, ∀i ∈ B ∀(i j) ∈ EB . [sent-386, score-1.026]

58 1/ε Therefore, as ε → 0+ , the τi (= bi ) for i ∈ B should concentrate their mass on a deterministic conﬁguration, but bi may continue to have soft values. [sent-389, score-0.342]

59 At the ﬁxed point of mixed-product BP in Algorithm 2 , the mixed-beliefs deﬁned in (23) satisfy Reparameterization: bi j (xi , x j ) ρi j p(x) ∝ ∏ bi (xi ) ∏ . [sent-392, score-0.342]

60 (24) bi (xi )b j (x j ) i∈V (i j)∈E Mixed-consistency: (a) ∑ bi j (xi , x j ) = b j (x j ), ∀i ∈ A, j ∈ A ∪ B, (25) max bi j (xi , x j ) = b j (x j ), ∀i ∈ B, j ∈ B, (26) ∑ ∀i ∈ B, j ∈ A. [sent-393, score-0.584]

61 (27) xi (b) (c) xi bi j (xi , x j ) xi ∈arg max bi = b j (x j ), Proof. [sent-394, score-0.764]

62 3181 L IU AND I HLER The three mixed-consistency constraints exactly map to the three types of message updates in Algorithm 2. [sent-396, score-0.309]

63 In other words, GC∪A is a semiA-B tree if it is an A-B tree when ignoring any edges entirely within the max set B. [sent-404, score-0.263]

64 (2011) proposed a similar hybrid message passing algorithm, repeated here as Algorithm 3, which differs from our mixed-product BP only in replacing our argmax-product message update (22) with the usual max-product message update (21). [sent-436, score-0.532]

65 m j→i (xi ) xi∗ = arg maxxi bi (xi ) for ∀i ∈ B, where bi (xi ) ∝ ψi (xi )m∼i (xi ). [sent-442, score-0.448]

66 By tracking the messages, one can write its ﬁnal decoded solution in a closed form, ∗ x4 = arg max ∑ ∑ max[exp(θ(x))], ∗ x3 = arg max ∑ ∑ max[exp(θ(x))], x3 x4 x1 x2 x4 x2 x1 x3 On the other hand, the true marginal MAP solution is given by, ∗ x4 = arg max max ∑ ∑[exp(θ(x))]. [sent-459, score-0.58]

67 ∗ x3 = arg max max ∑ ∑[exp(θ(x))], x3 x4 x4 x1 x2 x3 x2 x1 Here, Algorithm 3 approximates the exact marginal MAP problem by rearranging the max and sum operators into an elimination order that makes the calculation easier. [sent-460, score-0.461]

68 After a message sequence passed from x3 to x4 , one can ∗ show that b4 (x4 ) and x4 update to ∗ x4 = arg max b4 (x4 ), x4 ∗ ∗ b4 (x4 ) = ∑ ∑ exp(θ([x3 , x¬3 ])) = exp(Q([x3 , x4 ]; θ)), x2 x1 ∗ where we maximize the exact objective function Q([x3 , x4 ]; θ) with ﬁxed x3 = x3 . [sent-471, score-0.294]

69 , Martinet, 1970; Rockafellar, 1976) to derive convergent algorithms that directly optimize our free energy objectives, which take the form of transforming marginal MAP into a sequence of pure (or annealed) suminference tasks. [sent-481, score-0.303]

70 The proximal point algorithm works by iteratively optimizing a smoothed problem, τt+1 = arg min{−Fmix (τ, θ) + λt D(τ||τt )}, τ∈M where τt is the solution at iteration t, and λt is a positive coefﬁcient. [sent-485, score-0.277]

71 τtB (xB ) In this case, the proximal point algorithm reduces to Algorithm 4, which iteratively solves a smoothed free energy objective, with natural parameter θt updated at each iteration. [sent-493, score-0.345]

72 Intuitively, the proximal inner loop (28)-(29) essentially “adds back” the truncated entropy term HB (τ), while 3185 L IU AND I HLER canceling its effect by adjusting θ in the opposite direction. [sent-494, score-0.286]

73 Then the resulting approximate algorithm can be interpreted as a proximal algorithm that ˆ maximizes Fmix (τ, θ) with proximal function as D pair (τ||τt ) = ∑ κi KL[τi (xi )||τ0 (xi )] + i i∈B ∑ κi→ j KL[(τi j (xi |x j )||τ0j (xi |x j )]. [sent-502, score-0.454]

74 The global convergence guarantees of the proximal point algorithm may also fail if the inner update (29) is not solved exactly. [sent-508, score-0.28]

75 B B Then the dual optimization (8) remains exact if the marginal polytope M is replaced by any N satisfying Mo ⊆ N ⊆ M, that is, ΦAB = max{ θ, τ + HA|B (τ)}. [sent-521, score-0.271]

76 A junction graph is a junction tree if G is a tree. [sent-553, score-0.263]

77 Further, we approximate HB (τ) by a slightly more restrictive entropy decomposition, HB (τ) ≈ ∑ Hπ (τ), k k∈V where {πk : k ∈ V } is a non-overlapping partition of the max nodes B satisfying πk ⊆ ck for ∀k ∈ V . [sent-557, score-0.249]

78 In other words, π represents an assignment of each max node xb ∈ B into a cluster k with xb ∈ πk . [sent-558, score-1.013]

79 Overall, the marginal MAP dual form in (8) is approximated by max τ∈L(G ) θ, τ + ∑ Hc (τ) + ∑ Hc |π (τ) − ∑ k k∈A k k∈B k Hskl (τ) (32) (kl)∈E where Hck |πk (τ) = Hck (τ) − Hπk (τ). [sent-561, score-0.263]

80 Algorithm 5 can be seen as 3188 VARIATIONAL ALGORITHMS FOR M ARGINAL MAP sum max f be b bc abc be bef a (a) (b) Figure 3: (a) An example of marginal MAP problem, where d, c, e are sum nodes (shaded) and a, b, f are max nodes. [sent-564, score-0.414]

81 (2011)’s hybrid message passing and a state-of-the-art local search algorithm (Park and Darwiche, 2004). [sent-576, score-0.258]

82 (2011)’s hybrid message passing (with Bethe weights) in Algorithm 3 (Jiang’s method), 3189 L IU AND I HLER where the solutions are all extracted by maximizing the singleton marginals of the max nodes. [sent-578, score-0.377]

83 We also run the proximal point version of mixed-product BP with Bethe weights (Proximal (Bethe) ), which is Algorithm 4 with both HA|B (τ) and HB (τ) approximated by Bethe approximations. [sent-583, score-0.26]

84 We also implement the TRW approximation, but only using the convergent proximal point algorithm, because the TRW upper bounds are valid only when the algorithms converge. [sent-584, score-0.266]

85 We run all the proximal point algorithms for a maximum of 100 iterations, with a maximum of 5 iterations of weighted message passing updates (18)-(19) for the inner loops (with 5 additional damping with 0. [sent-586, score-0.418]

86 2 Diagnostic Bayesian Networks We also test our algorithms on two diagnostic Bayesian networks taken from the UAI08 Inference Challenge, where we construct marginal MAP problems by randomly selecting varying percentages of nodes to be max nodes. [sent-610, score-0.324]

87 Since these models are not pairwise, we implement the junction graph versions of mix-product (Bethe) and proximal (Bethe) shown in Section 8. [sent-611, score-0.325]

88 3 Insights Across all the experiments, we ﬁnd that mix-product (Bethe), proximal (Bethe) and local search (SamIam) signiﬁcantly outperform all the other algorithms, while proximal (Bethe) outperforms the two others in some circumstances. [sent-614, score-0.478]

89 However, the performance of SamIam tends to degenerate when the max part has loopy dependency structures (see Figure 7), or when the number of max nodes is large (see Figure 8), both of which make it difﬁcult to explore the solution space by local search. [sent-616, score-0.279]

90 (2011) is signiﬁcantly worse than mix-product (Bethe), proximal (Bethe) and local search (SamIam), but is otherwise the best among the remaining algorithms. [sent-621, score-0.251]

91 5 (b) Figure 5: (a) A typical latent tree model, whose leaf nodes are taken to be max nodes (white) and non-leaf nodes to be sum nodes (shaded). [sent-663, score-0.482]

92 Conclusion and Further Directions We have presented a general variational framework for solving marginal MAP problems approximately, opening new doors for developing efﬁcient algorithms. [sent-666, score-0.251]

93 5 (b) Figure 6: (a) A marginal MAP problem deﬁned on a 10 × 10 Ising grid, with shaded sum nodes and unshaded max nodes; note that the sum part is a loopy graph, while max part is fully disconnected. [sent-676, score-0.455]

94 5 (b) Figure 7: (a) A marginal MAP problem deﬁned on a 10 × 10 Ising grid, but with max / sum part exactly opposite to that in Figure 6; note that the max part is loopy, while the sum part is fully disconnected in this case. [sent-686, score-0.342]

95 By Theorem 9, the beliefs {bi , bi j } should satisfy the reparameterization property in (24) and the consistency conditions in (25)-(27). [sent-730, score-0.272]

96 Without loss of generality, we assume {bi , bi j } are normalized such that ∑xi bi (xi ) = 1 for i ∈ A and maxxi bi (xi ) = 1 for i ∈ B. [sent-731, score-0.569]

97 Because the ˆ max part of the beliefs, {bi , bi j : (i j) ∈ EB }, satisfy the standard max-consistency conditions, and the corresponding TRW weights {ρi j : (i j) ∈ EB } are provably convex by assumption, we establish ˆ that x∗ is the MAP solution of pB (xB ) by Theorem 1 of Weiss et al. [sent-737, score-0.298]

98 Therefore, ˆ ∑ pA|B ([xA , x∗ ]) = B xA = ∏∑ i∈∂A xi ∏ i∈∂A ∗ bi,πi (xi , xπi ) ∗ bπi (xπi ) 1 ∗ bπi (xπi ) ρi,πi ρi,πi 1−ρi,πi bi (xi ) ∑ bi (xi ) xi = 1. [sent-747, score-0.576]

99 The term rAD (x) is deﬁned as ˆ rAD ([xA , xD ]) = ˆ ∏ (i j)∈∂AD bi j (xi , x j ) bi (xi )b j (x j ) ρi j , where similarly ∂AD is the set of edges across A and D. [sent-752, score-0.342]

100 Because x∗ = arg maxx j b j (x j ) for j ∈ D, we have bi j (xi , x∗ ) = bi (xi ) for (i j) ∈ ∂AD , j ∈ D by the j j mixed-consistency condition in (27). [sent-753, score-0.392]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xb', 0.471), ('bethe', 0.257), ('trw', 0.25), ('xa', 0.238), ('bp', 0.231), ('proximal', 0.227), ('map', 0.188), ('ha', 0.187), ('bi', 0.171), ('ab', 0.147), ('marginal', 0.146), ('hler', 0.125), ('hb', 0.123), ('message', 0.121), ('xi', 0.117), ('arginal', 0.107), ('variational', 0.105), ('xck', 0.097), ('tree', 0.096), ('marginalization', 0.091), ('xc', 0.089), ('hi', 0.084), ('polytope', 0.079), ('ea', 0.079), ('fmix', 0.076), ('nodes', 0.072), ('jiang', 0.072), ('max', 0.071), ('energy', 0.07), ('reparameterization', 0.07), ('passing', 0.07), ('samiam', 0.069), ('eb', 0.069), ('junction', 0.069), ('em', 0.069), ('iu', 0.068), ('wainwright', 0.067), ('mo', 0.064), ('messages', 0.063), ('wi', 0.06), ('entropy', 0.059), ('maxxi', 0.056), ('arg', 0.05), ('skl', 0.049), ('pa', 0.048), ('belief', 0.048), ('free', 0.048), ('ck', 0.047), ('optimality', 0.046), ('dual', 0.046), ('coupling', 0.045), ('pairwise', 0.043), ('hybrid', 0.043), ('pb', 0.043), ('maximization', 0.043), ('propagation', 0.042), ('tab', 0.042), ('loopy', 0.041), ('singleton', 0.041), ('convergent', 0.039), ('yedidia', 0.037), ('semi', 0.037), ('ihler', 0.036), ('darwiche', 0.036), ('diagnostic', 0.035), ('rainy', 0.035), ('sunny', 0.035), ('weights', 0.033), ('guration', 0.032), ('gc', 0.032), ('hidden', 0.032), ('beliefs', 0.031), ('kl', 0.031), ('ga', 0.031), ('marginals', 0.031), ('mi', 0.03), ('ii', 0.03), ('strength', 0.03), ('annealed', 0.03), ('graph', 0.029), ('mm', 0.029), ('update', 0.028), ('posteriori', 0.028), ('elati', 0.028), ('weather', 0.028), ('joint', 0.028), ('sum', 0.027), ('weiss', 0.027), ('globally', 0.025), ('guarantees', 0.025), ('elimination', 0.025), ('hazan', 0.025), ('park', 0.024), ('koller', 0.024), ('objective', 0.024), ('rror', 0.024), ('local', 0.024), ('provably', 0.023), ('tractable', 0.023), ('trees', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

2 0.14479707 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Speciﬁcally, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are veriﬁed in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-ﬁeld (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we deﬁne its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning ﬂows

3 0.14168324 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP ﬁxed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefﬁcients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP ﬁxed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefﬁcients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical ﬂow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

4 0.11719876 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

Author: Nicholas Ruozzi, Sekhar Tatikonda

Abstract: Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean (and variances) of a multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive deﬁnite quadratic function. Sufﬁcient conditions, such as walk-summability, that guarantee the convergence and correctness of GaBP are known, but GaBP may fail to converge to the correct solution given an arbitrary positive deﬁnite covariance matrix. As was observed by Malioutov et al. (2006), the GaBP algorithm fails to converge if the computation trees produced by the algorithm are not positive deﬁnite. In this work, we will show that the failure modes of the GaBP algorithm can be understood via graph covers, and we prove that a parameterized generalization of the min-sum algorithm can be used to ensure that the computation trees remain positive deﬁnite whenever the input matrix is positive deﬁnite. We demonstrate that the resulting algorithm is closely related to other iterative schemes for quadratic minimization such as the Gauss-Seidel and Jacobi algorithms. Finally, we observe, empirically, that there always exists a choice of parameters such that the above generalization of the GaBP algorithm converges. Keywords: belief propagation, Gaussian graphical models, graph covers

5 0.10862884 68 jmlr-2013-Machine Learning with Operational Costs

Author: Theja Tulabandhula, Cynthia Rudin

Abstract: This work proposes a way to align statistical modeling with decision making. We provide a method that propagates the uncertainty in predictive modeling to the uncertainty in operational cost, where operational cost is the amount spent by the practitioner in solving the problem. The method allows us to explore the range of operational costs associated with the set of reasonable statistical models, so as to provide a useful way for practitioners to understand uncertainty. To do this, the operational cost is cast as a regularization term in a learning algorithm’s objective function, allowing either an optimistic or pessimistic view of possible costs, depending on the regularization parameter. From another perspective, if we have prior knowledge about the operational cost, for instance that it should be low, this knowledge can help to restrict the hypothesis space, and can help with generalization. We provide a theoretical generalization bound for this scenario. We also show that learning with operational costs is related to robust optimization. Keywords: statistical learning theory, optimization, covering numbers, decision theory

6 0.096031733 108 jmlr-2013-Stochastic Variational Inference

7 0.095222235 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

8 0.093503654 76 jmlr-2013-Nonparametric Sparsity and Regularization

9 0.090806037 121 jmlr-2013-Variational Inference in Nonconjugate Models

10 0.07836616 90 jmlr-2013-Quasi-Newton Method: A New Direction

11 0.076680057 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

12 0.06079771 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

13 0.056474876 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

14 0.055732016 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

15 0.054401025 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

16 0.051878244 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

17 0.046337757 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

18 0.045668449 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

19 0.042681769 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

20 0.042338733 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.249), (1, -0.127), (2, 0.062), (3, 0.06), (4, 0.086), (5, 0.222), (6, 0.093), (7, 0.033), (8, -0.051), (9, 0.047), (10, -0.093), (11, 0.162), (12, 0.226), (13, 0.162), (14, 0.08), (15, -0.212), (16, -0.238), (17, 0.116), (18, -0.028), (19, 0.122), (20, 0.043), (21, 0.05), (22, 0.06), (23, 0.007), (24, 0.055), (25, 0.069), (26, -0.031), (27, -0.094), (28, -0.144), (29, 0.071), (30, -0.04), (31, -0.079), (32, -0.056), (33, -0.047), (34, 0.057), (35, 0.057), (36, 0.042), (37, 0.055), (38, -0.066), (39, 0.07), (40, -0.052), (41, 0.033), (42, 0.026), (43, -0.03), (44, -0.03), (45, -0.077), (46, -0.033), (47, 0.037), (48, -0.003), (49, -0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94857132 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

2 0.72170681 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

3 0.49161181 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

Author: Nicholas Ruozzi, Sekhar Tatikonda

4 0.49152377 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

5 0.42782319 68 jmlr-2013-Machine Learning with Operational Costs

Author: Theja Tulabandhula, Cynthia Rudin

6 0.39591709 76 jmlr-2013-Nonparametric Sparsity and Regularization

7 0.37801722 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

8 0.34471282 90 jmlr-2013-Quasi-Newton Method: A New Direction

9 0.30902702 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

10 0.3009502 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

11 0.28123215 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

12 0.27679712 108 jmlr-2013-Stochastic Variational Inference

13 0.25735259 121 jmlr-2013-Variational Inference in Nonconjugate Models

14 0.25671023 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

15 0.25221369 106 jmlr-2013-Stationary-Sparse Causality Network Learning

16 0.2283852 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

17 0.22582023 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

18 0.22394899 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

19 0.22350891 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.21629331 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (5, 0.11), (6, 0.108), (10, 0.066), (14, 0.011), (20, 0.016), (23, 0.017), (53, 0.011), (68, 0.032), (70, 0.02), (75, 0.078), (85, 0.022), (87, 0.017), (88, 0.012), (93, 0.018), (96, 0.322)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74934149 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

2 0.48236033 106 jmlr-2013-Stationary-Sparse Causality Network Learning

Author: Yuejia He, Yiyuan She, Dapeng Wu

Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identiﬁcation. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efﬁcient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efﬁciency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap

3 0.47284672 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

Author: Dan Stowell, Mark D. Plumbley

Abstract: We describe an inference task in which a set of timestamped event observations must be clustered into an unknown number of temporal sequences with independent and varying rates of observations. Various existing approaches to multi-object tracking assume a ﬁxed number of sources and/or a ﬁxed observation rate; we develop an approach to inferring structure in timestamped data produced by a mixture of an unknown and varying number of similar Markov renewal processes, plus independent clutter noise. The inference simultaneously distinguishes signal from noise as well as clustering signal observations into separate source streams. We illustrate the technique via synthetic experiments as well as an experiment to track a mixture of singing birds. Source code is available. Keywords: multi-target tracking, clustering, point processes, ﬂow network, sound

4 0.45426887 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

Author: Manfred Opper, Ulrich Paquet, Ole Winther

Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian ﬁeld, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to ﬁrst order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process

5 0.44927418 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari

Abstract: This paper considers probabilistic multinomial probit classiﬁcation using Gaussian process (GP) priors. Challenges with multiclass GP classiﬁcation are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classiﬁcation rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classiﬁcation accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classiﬁcation, multinomial probit, approximate inference, expectation propagation

6 0.44791558 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

7 0.44753227 82 jmlr-2013-Optimally Fuzzy Temporal Memory

8 0.44518092 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

9 0.44506317 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

10 0.4436768 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

11 0.44323027 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

12 0.44237447 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

13 0.44230059 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

14 0.44114256 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

15 0.44100484 22 jmlr-2013-Classifying With Confidence From Incomplete Information

16 0.44062197 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

17 0.4401345 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

18 0.4384267 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

19 0.43807122 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

20 0.4376474 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning