nips nips2007 nips2007-157 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael Kearns, Jinsong Tan, Jennifer Wortman
Abstract: We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms — distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Consider a network of human social contacts, and imagine that each party would like to compute or estimate their probability of having contracted a contagious disease, which depends on the exposures to the disease of their immediate neighbors in the network. [sent-3, score-0.543]
2 If network participants (or their proxy algorithms) engage in standard belief propagation, each party would learn their probability of exposure conditioned on any evidence — and a great deal more, including information about the exposure probabilities of their neighbors. [sent-4, score-0.646]
3 Obviously such leakage of non-local information is highly undesirable in settings where we regard each party in the network as a self-interested agent, and privacy is paramount. [sent-5, score-0.536]
4 By a privacy-preserving version of inference (for example), we informally mean a protocol in which each party learns their conditional probability of exposure to the disease and absolutely nothing else. [sent-7, score-0.928]
5 More precisely, anything a party can efficiently compute after having participated in the protocol, they could have efficiently computed alone given only the value of their conditional probability — thus, the protocol leaked no additional information beyond its desired outputs. [sent-8, score-0.745]
6 Put another way, the straightforward solution from cryptography requires every party in the network to have the ability to broadcast to all others, and the resulting algorithm may bear little resemblance to standard belief propagation. [sent-10, score-0.64]
7 While there has been work on minimizing the number of messages exchanged in centralized privacy-preserving protocols [9], ours are the first results that preserve the local communication structure of distributed algorithms like belief propagation. [sent-12, score-0.45]
8 Our protocols are faithful to the network topology, requiring only the passing of messages between parties separated by one or two hops in the network. [sent-13, score-0.518]
9 1 All results in this paper apply to the “semi-honest” or “honest but curious” model in the cryptography literature, in which participants obediently execute the protocol but may attempt to infer non-private information from it. [sent-16, score-0.44]
10 We expect that via the use of zero-knowledge proof techniques, our protocols may be strengthened to models in which participants who deviate from the protocol are detected. [sent-17, score-0.44]
11 Imagine a scenario in which there are k distinct parties, each in possession of exactly one of these inputs (that is, party i initially knows x i ) and the k parties would like to jointly compute the value of f (x1 , . [sent-23, score-0.773]
12 Perhaps the simplest protocol would have all parties share their private inputs and then individually compute the value of f . [sent-27, score-0.711]
13 However, in many natural settings, we would like the parties to be able to perform this joint computation in a privacy-preserving fashion, with each party revealing as little as possible about their private input. [sent-28, score-0.746]
14 Simple examples include voting — we would all like to learn the results of the election without having to broadcast our private votes — and the so-called “Millionaire’s Problem” in which two individuals would like to learn who is wealthier, without revealing their precise wealth to each other. [sent-29, score-0.314]
15 If a trusted “third party” is available, one solution would be to provide the private inputs to them, and have them perform the computation in secrecy, only announcing the final result. [sent-30, score-0.213]
16 The purpose of the theory of secure multiparty function computation [6] is to show that under extremely general circumstances, a third party is surprisingly unnecessary. [sent-31, score-0.697]
17 For example, in the Millionaire’s Problem, there is no avoiding the poorer party learning a lower bound on the richer’s wealth (namely, the poorer party’s wealth). [sent-33, score-0.398]
18 To formalize this notion in a complexity-theoretic framework, let us assume without loss of generality that each input xi is n bits in length. [sent-35, score-0.379]
19 We (informally) define a protocol Π for the k parties to compute f to be a specific mechanism by which the parties exchange messages and perform computations, ending with every party learning the value y = f (x1 , . [sent-37, score-1.25]
20 One (uninteresting) protocol is the one in which each party sends their private inputs to all others, and every party computes y alone. [sent-41, score-1.231]
21 Definition 1 1 Let Π be any protocol for the k parties to jointly compute the value y = f (x1 , . [sent-42, score-0.575]
22 We say that Π is privacy-preserving if for every 1 ≤ i ≤ k, anything that party i can compute in time polynomial in n following the execution of Π, they could also compute in polynomial time given only their private input x i and the value y. [sent-46, score-0.747]
23 In other words, whatever information party i is able to obtain from their view of the execution of protocol Π, it does not let them efficiently compute anything they couldn’t efficiently compute just from being told the final output y of Π (and their private input xi ). [sent-47, score-1.28]
24 This captures the notion that while y itself may “leak” some information about the other private inputs xj , the protocol Π yields nothing further. [sent-48, score-0.799]
25 Then under standard cryptographic assumptions, 3 there exists a polynomial time privacy-preserving protocol Π for f (that is, a protocol in which party i learns nothing not already implied by their private input xi and private output yi ). [sent-59, score-1.93]
26 It involves both formalizing the notion of a multiparty computation protocol, as well as defining the “view” of an individual party of a specific execution of the protocol. [sent-61, score-0.512]
27 2 Our definition of privacy does not imply that coalitions of parties cannot together compute additional information. [sent-63, score-0.388]
28 In the extended version of this paper, we discuss the difficulty of achieving this stronger notion of privacy with any protocol that uses a truly distributed method of computation. [sent-64, score-0.489]
29 2 This remarkable and important theorem essentially says that whatever a population can jointly compute, it can jointly compute with arbitrary restrictions on who learns what. [sent-66, score-0.234]
30 For instance, in the Millionaire’s Problem, by defining one player’s output to always be nil, we can ensure that this player learns absolutely nothing from the protocol, while the other learns which player is wealthier. [sent-68, score-0.344]
31 The proof of Theorem 1 is constructive, providing an algorithm to transform any polynomial circuit into a polynomial-time privacy-preserving protocol for k parties. [sent-69, score-0.344]
32 A standard public-key cryptosystem allows any party to generate a pair of keys (P, S), which we can think of as k-bit strings; k is often called the security parameter. [sent-73, score-0.459]
33 Associated with the public key P there is a (possibly probabilistic) encryption function EP and associated with the secret or private key S there is a (deterministic) decryption function D S . [sent-74, score-0.562]
34 • For any n-bit x, it is hard for a party knowing only the public key P and the encryption EP (x) to compute x. [sent-79, score-0.738]
35 4 Thus, in such a scheme, anyone knowing the public key of Alice can efficiently compute and send encrypted messages to Alice, but only Alice, who is the sole party knowing her private key, can decrypt those messages. [sent-80, score-1.285]
36 As one specific example that allows probabilistic encryption of individual bits, let the public key consist of an integer N = p · q that is the product of two k/2-bit randomly generated prime numbers p and q, as well as a number z that has the property that z is not equal to x2 mod N for any x. [sent-82, score-0.412]
37 In order to encrypt a 0, one simply chooses x at random and lets the encryption be y = x 2 mod N , known as a quadratic residue. [sent-84, score-0.311]
38 In order to encrypt a 1, one instead sends y = x2 · z mod N , which is guaranteed to not be a quadratic residue. [sent-85, score-0.226]
39 It is not difficult to show that given the factors p and q (which constitute the secret key), one can efficiently compute whether y is a quadratic residue and thus learn the decrypted bit. [sent-86, score-0.217]
40 Thus, a party knowing only Alice’s public key can nevertheless take any bit encrypted for Alice and generate random re-encryptions of that bit without needing to actually know the decryption. [sent-90, score-0.853]
41 We refer to this operation as blinding an encrypted bit. [sent-91, score-0.31]
42 In standard belief propagation, we are given an undirected graphical model with vertex set X for which the underlying network is a tree. [sent-93, score-0.262]
43 For each Xi ∈ X , we are given a non-negative potential function ψi over possible values xi ∈ V(Xi ). [sent-96, score-0.382]
44 The main inductive phase of the belief propagation algorithm is the message-passing phase. [sent-98, score-0.252]
45 This message is indexed by all possible assignments xj ∈ V(Xj ), and is defined inductively by µi→j (xj ) = ψi (xi )ψi,j (xi , xj ) xi ∈V(Xi ) µk→i (xi ). [sent-100, score-0.924]
46 (1) Xk ∈N (Xi )\Xj Belief propagation follows the so-called message-passing protocol, in which any vertex of degree d that has received the incoming messages from any d−1 of its neighbors can perform the computation above in order to send an outgoing message to its remaining neighbor. [sent-101, score-0.723]
47 Eventually, the vertex will receive a message back from this last neighbor, at which point it will be able to calculate messages to send to its remaining d − 1 neighbors. [sent-102, score-0.385]
48 The protocol begins at the leaves of the tree, and it follows from standard arguments that until all nodes have received incoming messages from all of their neighbors, there must be some vertex that is ready to compute and send a new message. [sent-103, score-0.76]
49 The message-passing phase ends when all vertices have received messages from all of their neighbors. [sent-104, score-0.211]
50 Once vertex Xi has received all of its incoming messages, the marginal distribution P is proportional to their product. [sent-105, score-0.222]
51 That is, if xi is any setting to Xi , then P[Xi = xi ] ∝ ψi (xi ) µj→i (xi ). [sent-106, score-0.696]
52 In this case it is well-known that the algorithm computes the conditional marginals P[X i = xi |E = e]. [sent-108, score-0.414]
53 Recall that our fundamental privacy goal is to allow each vertex Xi to compute its own marginal distribution P[Xi = xi ] (or P[Xi = xi |E = e] if there is evidence), but absolutely nothing else. [sent-113, score-1.146]
54 In particular, Xi should not be able to compute the values of any of the incoming messages from its neighbors. [sent-114, score-0.248]
55 Knowledge of µj→i (xi ), for example, along with µi→j and ψi,j , may give Xi information about the marginals over Xj , a clear privacy violation. [sent-115, score-0.207]
56 We thus must somehow prevent Xi from being able to “read” any of its incoming messages — nor even its own outgoing messages — yet still allow each variable to learn its own set of marginals at the end. [sent-116, score-0.463]
57 To accomplish this we combine tools from secure multiparty function computation with a method we call “mask propagation”, in which messages remain “masked” (that is, provably unreadable) to the vertices at all times. [sent-117, score-0.514]
58 The keys required to unmask the messages are generated locally as the computation propagates through the tree, thus preserving the original communication pattern of the standard (non-private) algorithm. [sent-118, score-0.28]
59 We can think of the random value r as “masking” or hiding the value of x to a party that does not know r, while leaving it readable to a party that does. [sent-127, score-0.728]
60 Let us now return to the message-passing phase of the algorithm described by Equation (1), and let us focus on the computation of µi→j for a fixed setting xj of Xj . [sent-128, score-0.287]
61 Now under these informational assumptions, vertex X i knows the set Ii = {µ →i (xi ) + βj, (xi ) mod N : X ∈ N (Xi )\Xj , xi ∈ V(Xi )} while vertex Xj knows the set Ij = {βj, (xi ) mod N : X ∈ N (Xi )\Xj , xi ∈ V(Xi )}. [sent-133, score-1.332]
62 In order to complete the inductive step, it will be necessary for each Xk ∈ N (Xj )\Xi to provide a set of masking values βk,i (xj ) so that Xj can obtain a set of masked messages of the form µi→j (xj ) + βk,i (xj ). [sent-135, score-0.362]
63 It is clear that, ignoring privacy concerns, Xi and Xj together could compute ψi (xi )ψi,j (xi , xj ) X ∈N (Xi )\Xj µ →i (xi ) for each fixed pair xi and xj . [sent-138, score-1.029]
64 Xj will obtain masked messages in a similar manner from all but one of its other neighbors in turn, and for all of its other values, until the inductive assumptions are fully satisfied at Xj . [sent-144, score-0.377]
65 Every value received by Xi , Xj , and Xk during the above protocol is distributed uniformly at random in Zn from the perspective of its recipient, and thus conveys no information. [sent-145, score-0.378]
66 Furthermore, it is acceptable for Xj to learn its incoming messages directly, since these messages will be implied by its final marginal. [sent-148, score-0.454]
67 Thus by Theorem 1, we can construct a protocol for them to efficiently compute this value in such a way that Xj learns only µi→j (xj ) and Xi learns nothing. [sent-150, score-0.487]
68 At the end of the message-passing phase, each internal (non-leaf) node Xi will know a set of masked messages from each of its neighbors. [sent-151, score-0.27]
69 In particular, for each pair Xj , X ∈ N (Xi ), for each xi ∈ V(Xi ), Xi will know the values of µj→i (xi ) + β ,j (xi ). [sent-152, score-0.348]
70 Ignoring privacy concerns, it is the case that Xi and any pair of its neighbors could compute the marginal of Xi in Equation (2). [sent-153, score-0.289]
71 Invoking Theorem 1 again, we can construct an efficient protocol for Xi and this pair of neighbors to together compute the marginals such that Xi learns only the marginals and the neighbors learn nothing. [sent-154, score-0.796]
72 5 Theorem 2 Under standard cryptographic assumptions, PrivateBeliefProp(T ) allows every variable Xi to compute its own marginal distribution P[Xi ] and nothing else (that is, nothing not already computable in polynomial time from only P[Xi ] and the initial potential functions). [sent-158, score-0.39]
73 Direct communication occurs only between variables who are immediate neighbors or two steps away in T , and secure function computation is never invoked on sets of more than three variables. [sent-159, score-0.371]
74 Loopy Belief Propagation: Theorem 2 can be extended to privacy-preserving loopy belief propagation on graphs that contain cycles. [sent-161, score-0.236]
75 Because of the protocol’s faithfulness to the original algorithm, the same convergence and correctness claims hold as in standard loopy belief propagation [7]. [sent-162, score-0.236]
76 Privacy-Preserving Junction Tree: The protocol can also be modified to perform privacypreserving belief propagation on a junction tree [11]. [sent-167, score-0.527]
77 Here it is necessary to take intra-clique privacy into account in order to enforce that variables can learn only their own marginals and not, for example, the marginals of other nodes within the same clique. [sent-168, score-0.321]
78 4 Privacy-Preserving Gibbs Sampling We now move on to the problem of secure Gibbs sampling on an undirected graphical model G. [sent-171, score-0.223]
79 Again, we would like to accomplish this with limited communication so that no vertex is required to communicate with a vertex more than two hops away. [sent-183, score-0.349]
80 5 Since the application of standard secure function computation requires broadcast among all participants, it is a feature of the algorithm that it limits such invocations to three parties at a time. [sent-184, score-0.469]
81 Using such an encoding, there is a simple but relatively inefficient construction for privacypreserving Gibbs sampling that uses only secure multiparty function computation, but that invokes Theorem 1 on entire neighborhoods of the graph. [sent-190, score-0.362]
82 Here we describe a much more communication-efficient protocol using blinded encryption. [sent-192, score-0.291]
83 First each neighbor Vj of Xi encrypts column j of Ti using its own public key and passes the encrypted column to Xi . [sent-203, score-0.591]
84 Next Xi encrypts column d + 1 using its own public key. [sent-204, score-0.223]
85 Xi then concatenates the d + 1 encrypted columns together to form an encrypted version of Ti in which column j is encrypted using the public key of Vj for 1 ≤ j ≤ d and column d + 1 is encrypted using the public key of Xi . [sent-205, score-1.324]
86 Xi then takes the resulting table, randomly permutes the rows, and blinds (randomly re-encrypts) each entry using the appropriate public keys (i. [sent-206, score-0.213]
87 the key of Vj for column j where 1 ≤ j ≤ d and its own public key for column d + 1). [sent-208, score-0.323]
88 The purpose of the blinding steps here is to prevent parties from tracking correspondences between cleartext and encrypted table entries. [sent-210, score-0.511]
89 For instance, without blinding above, N ∗ (Xi ) could reconstruct the permutation chosen by Xi by seeing how its own encrypted values have been rearranged. [sent-211, score-0.31]
90 It then randomly permutes the rows of the table, blinds each entry using the appropriate public keys (those of Vj for columns 1 ≤ j ≤ d and its own for column d + 1), and sends the updated table back to Xi . [sent-215, score-0.367]
91 Each column j will be encrypted by the public key of Vj , with the exception of the final column, which will be encrypted by both Xi and N ∗ (Xi ). [sent-217, score-0.662]
92 Once these encrypted tables have been computed for each node, we begin the main Gibbs sampling protocol. [sent-219, score-0.255]
93 Vj can decrypt column j of Ti in order to learn which rows correspond to its value being 0 and which rows correspond to its values being 1. [sent-223, score-0.296]
94 After this secure computation of partitions has been completed for all neighbors of X i , Xi will be able to compute the intersection of the subsets of rows it has received from each neighbor. [sent-226, score-0.459]
95 Calling upon Theorem 1 once again, this means that we can construct an efficient protocol for Xi and N ∗ (Xi ) to together complete these computations in such a way that they only learn the new bits b i and bi respectively. [sent-230, score-0.401]
96 After the value of each node has been privately resampled sufficiently many times, we can use one final application of secure multi-party computation between each node Xi and its distinguished neighbor N ∗ (Xi ) to allow Xi to learn its final value. [sent-233, score-0.515]
97 Since our interests are in privacy considerations only, let us use PrivateGibbs to refer to the protocol described above when applied to any fixed Markov network, combined with some fixed updating schedule (such as random or a fixed ordering) and some number r of iterations. [sent-235, score-0.473]
98 Direct communication occurs only between variables who are immediate neighbors or two steps away, and secure function computation is never invoked on sets of more than three variables. [sent-237, score-0.371]
99 We note that PrivateGibbs enjoys an even stronger privacy property — even if any subset of parties collude by combining their post-protocol views, they can learn nothing not implied by their combined sampled values. [sent-239, score-0.535]
100 Furthermore, any convergence guarantees that hold for standard Gibbs sampling [4, 5] with the same updating schedule will also hold for the secure version. [sent-240, score-0.264]
wordName wordTfidf (topN-words)
[('party', 0.364), ('xi', 0.348), ('protocol', 0.291), ('xj', 0.247), ('encrypted', 0.217), ('parties', 0.201), ('secure', 0.185), ('vj', 0.158), ('messages', 0.147), ('privacy', 0.141), ('private', 0.141), ('vertex', 0.137), ('public', 0.133), ('encryption', 0.124), ('mod', 0.119), ('propagation', 0.111), ('cryptography', 0.108), ('protocols', 0.108), ('multiparty', 0.108), ('neighbors', 0.102), ('belief', 0.094), ('blinding', 0.093), ('nothing', 0.088), ('gibbs', 0.087), ('masking', 0.087), ('xk', 0.086), ('cryptographic', 0.081), ('masked', 0.081), ('decrypt', 0.077), ('learns', 0.075), ('alice', 0.067), ('ti', 0.066), ('marginals', 0.066), ('knows', 0.062), ('column', 0.059), ('distributed', 0.057), ('implied', 0.057), ('neighbor', 0.056), ('rows', 0.056), ('incoming', 0.055), ('send', 0.054), ('polynomial', 0.053), ('zn', 0.052), ('keys', 0.049), ('learn', 0.048), ('inductive', 0.047), ('message', 0.047), ('cryptosystem', 0.046), ('decryption', 0.046), ('millionaire', 0.046), ('privategibbs', 0.046), ('secret', 0.046), ('compute', 0.046), ('communication', 0.044), ('anything', 0.044), ('broadcast', 0.043), ('node', 0.042), ('participants', 0.041), ('schedule', 0.041), ('residue', 0.04), ('computation', 0.04), ('sends', 0.039), ('theorem', 0.039), ('sampling', 0.038), ('informally', 0.038), ('absolutely', 0.038), ('nal', 0.038), ('jointly', 0.037), ('resampled', 0.037), ('quadratic', 0.037), ('key', 0.036), ('knowing', 0.035), ('assignments', 0.035), ('exposure', 0.034), ('distinguished', 0.034), ('wealth', 0.034), ('player', 0.034), ('bit', 0.034), ('vertices', 0.034), ('potential', 0.034), ('inputs', 0.032), ('loopy', 0.031), ('bits', 0.031), ('network', 0.031), ('bi', 0.031), ('blindable', 0.031), ('blinds', 0.031), ('dodis', 0.031), ('encrypt', 0.031), ('encrypts', 0.031), ('hops', 0.031), ('nashprop', 0.031), ('nil', 0.031), ('possession', 0.031), ('privacypreserving', 0.031), ('privatebeliefprop', 0.031), ('privately', 0.031), ('received', 0.03), ('game', 0.03), ('leaf', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
Author: Michael Kearns, Jinsong Tan, Jennifer Wortman
Abstract: We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms — distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else. 1
2 0.22439347 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be: 1 1 λ−c (xs ) + max λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8
3 0.13705847 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1
4 0.10448773 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
5 0.095571391 128 nips-2007-Message Passing for Max-weight Independent Set
Author: Sujay Sanghavi, Devavrat Shah, Alan S. Willsky
Abstract: We investigate the use of message-passing algorithms for the problem of finding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufficient conditions for correctness of the estimate. We then develop a modification of max-product – one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms in conjunction is correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over finite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation. 1
6 0.094946221 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
7 0.09451817 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
8 0.08225935 32 nips-2007-Bayesian Co-Training
9 0.080142379 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
10 0.079252169 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
11 0.076031774 187 nips-2007-Structured Learning with Approximate Inference
12 0.071563721 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
13 0.068570085 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
14 0.067957416 141 nips-2007-New Outer Bounds on the Marginal Polytope
15 0.067071132 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
16 0.06699191 193 nips-2007-The Distribution Family of Similarity Distances
17 0.066787831 118 nips-2007-Learning with Transformation Invariant Kernels
18 0.065494187 166 nips-2007-Regularized Boost for Semi-Supervised Learning
19 0.062869996 53 nips-2007-Compressed Regression
20 0.062189557 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs
topicId topicWeight
[(0, -0.193), (1, -0.017), (2, -0.108), (3, 0.046), (4, -0.074), (5, -0.154), (6, 0.001), (7, 0.084), (8, 0.127), (9, -0.036), (10, -0.145), (11, -0.074), (12, -0.061), (13, 0.084), (14, 0.067), (15, 0.015), (16, 0.04), (17, 0.018), (18, -0.123), (19, -0.063), (20, 0.216), (21, -0.095), (22, 0.003), (23, 0.024), (24, -0.035), (25, -0.011), (26, -0.061), (27, 0.001), (28, -0.191), (29, 0.089), (30, 0.064), (31, -0.003), (32, -0.027), (33, -0.103), (34, 0.078), (35, 0.026), (36, 0.015), (37, -0.031), (38, -0.068), (39, 0.161), (40, 0.063), (41, -0.001), (42, 0.023), (43, 0.125), (44, -0.043), (45, 0.028), (46, 0.049), (47, 0.024), (48, -0.063), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.97860771 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
Author: Michael Kearns, Jinsong Tan, Jennifer Wortman
Abstract: We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms — distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else. 1
2 0.80474687 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: We present a novel message passing algorithm for approximating the MAP problem in graphical models. The algorithm is similar in structure to max-product but unlike max-product it always converges, and can be proven to find the exact MAP solution in various settings. The algorithm is derived via block coordinate descent in a dual of the LP relaxation of MAP, but does not require any tunable parameters such as step size or tree weights. We also describe a generalization of the method to cluster based potentials. The new method is tested on synthetic and real-world problems, and compares favorably with previous approaches. Graphical models are an effective approach for modeling complex objects via local interactions. In such models, a distribution over a set of variables is assumed to factor according to cliques of a graph with potentials assigned to each clique. Finding the assignment with highest probability in these models is key to using them in practice, and is often referred to as the MAP (maximum aposteriori) assignment problem. In the general case the problem is NP hard, with complexity exponential in the tree-width of the underlying graph. Linear programming (LP) relaxations have proven very useful in approximating the MAP problem, and often yield satisfactory empirical results. These approaches relax the constraint that the solution is integral, and generally yield non-integral solutions. However, when the LP solution is integral, it is guaranteed to be the exact MAP. For some classes of problems the LP relaxation is provably correct. These include the minimum cut problem and maximum weight matching in bi-partite graphs [8]. Although LP relaxations can be solved using standard LP solvers, this may be computationally intensive for large problems [13]. The key problem with generic LP solvers is that they do not use the graph structure explicitly and thus may be sub-optimal in terms of computational efficiency. The max-product method [7] is a message passing algorithm that is often used to approximate the MAP problem. In contrast to generic LP solvers, it makes direct use of the graph structure in constructing and passing messages, and is also very simple to implement. The relation between max-product and the LP relaxation has remained largely elusive, although there are some notable exceptions: For tree-structured graphs, max-product and LP both yield the exact MAP. A recent result [1] showed that for maximum weight matching on bi-partite graphs max-product and LP also yield the exact MAP [1]. Finally, Tree-Reweighted max-product (TRMP) algorithms [5, 10] were shown to converge to the LP solution for binary xi variables, as shown in [6]. In this work, we propose the Max Product Linear Programming algorithm (MPLP) - a very simple variation on max-product that is guaranteed to converge, and has several advantageous properties. MPLP is derived from the dual of the LP relaxation, and is equivalent to block coordinate descent in the dual. Although this results in monotone improvement of the dual objective, global convergence is not always guaranteed since coordinate descent may get stuck in suboptimal points. This can be remedied using various approaches, but in practice we have found MPLP to converge to the LP 1 solution in a majority of the cases we studied. To derive MPLP we use a special form of the dual LP, which involves the introduction of redundant primal variables and constraints. We show how the dual variables corresponding to these constraints turn out to be the messages in the algorithm. We evaluate the method on Potts models and protein design problems, and show that it compares favorably with max-product (which often does not converge for these problems) and TRMP. 1 The Max-Product and MPLP Algorithms The max-product algorithm [7] is one of the most often used methods for solving MAP problems. Although it is neither guaranteed to converge to the correct solution, or in fact converge at all, it provides satisfactory results in some cases. Here we present two algorithms: EMPLP (edge based MPLP) and NMPLP (node based MPLP), which are structurally very similar to max-product, but have several key advantages: • After each iteration, the messages yield an upper bound on the MAP value, and the sequence of bounds is monotone decreasing and convergent. The messages also have a limit point that is a fixed point of the update rule. • No additional parameters (e.g., tree weights as in [6]) are required. • If the fixed point beliefs have a unique maximizer then they correspond to the exact MAP. • For binary variables, MPLP can be used to obtain the solution to an LP relaxation of the MAP problem. Thus, when this LP relaxation is exact and variables are binary, MPLP will find the MAP solution. Moreover, for any variable whose beliefs are not tied, the MAP assignment can be found (i.e., the solution is partially decodable). Pseudo code for the algorithms (and for max-product) is given in Fig. 1. As we show in the next sections, MPLP is essentially a block coordinate descent algorithm in the dual of a MAP LP relaxation. Every update of the MPLP messages corresponds to exact minimization of a set of dual variables. For EMPLP minimization is over the set of variables corresponding to an edge, and for NMPLP it is over the set of variables corresponding to all the edges a given node appears in (i.e., a star). The properties of MPLP result from its relation to the LP dual. In what follows we describe the derivation of the MPLP algorithms and prove their properties. 2 The MAP Problem and its LP Relaxation We consider functions over n variables x = {x1 , . . . , xn } defined as follows. Given a graph G = (V, E) with n vertices, and potentials θij (xi , xj ) for all edges ij ∈ E, define the function1 f (x; θ) = θij (xi , xj ) . (1) ij∈E The MAP problem is defined as finding an assignment xM that maximizes the function f (x; θ). Below we describe the standard LP relaxation for this problem. Denote by {µij (xi , xj )}ij∈E distributions over variables corresponding to edges ij ∈ E and {µi (xi )}i∈V distributions corresponding to nodes i ∈ V . We will use µ to denote a given set of distributions over all edges and nodes. The set ML (G) is defined as the set of µ where pairwise and singleton distributions are consistent x ˆ xi µij (ˆi , xj ) = µj (xj ) , ˆ xj µij (xi , xj ) = µi (xi ) ∀ij ∈ E, xi , xj ˆ ML (G) = µ ≥ 0 ∀i ∈ V xi µi (xi ) = 1 Now consider the following linear program: MAPLPR : µL∗ = arg max µ∈ML (G) µ·θ. (2) where µ·θ is shorthand for µ·θ = ij∈E xi ,xj θij (xi , xj )µij (xi , xj ). It is easy to show (see e.g., [10]) that the optimum of MAPLPR yields an upper bound on the MAP value, i.e. µL∗ ·θ ≥ f (xM ). Furthermore, when the optimal µi (xi ) have only integral values, the assignment that maximizes µi (xi ) yields the correct MAP assignment. In what follows we show how the MPLP algorithms can be derived from the dual of MAPLPR. 1 P We note that some authors also add a term i∈V θi (xi ) to f (x; θ). However, these terms can be included in the pairwise functions θij (xi , xj ), so we ignore them for simplicity. 2 3 The LP Relaxation Dual Since MAPLPR is an LP, it has an equivalent convex dual. In App. A we derive a special dual of MAPLPR using a different representation of ML (G) with redundant variables. The advantage of this dual is that it allows the derivation of simple message passing algorithms. The dual is described in the following proposition. Proposition 1 The following optimization problem is a convex dual of MAPLPR DMAPLPR : min max xi i s.t. max βki (xk , xi ) (3) k∈N (i) xk βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ) , where the dual variables are βij (xi , xj ) for all ij, ji ∈ E and values of xi and xj . The dual has an intuitive interpretation in terms of re-parameterizations. Consider the star shaped graph Gi consisting of node i and all its neighbors N (i). Assume the potential on edge ki (for k ∈ N (i)) is βki (xk , xi ). The value of the MAP assignment for this model is max max βki (xk , xi ). This is exactly the term in the objective of DMAPLPR. Thus the dual xi k∈N (i) xk corresponds to individually decoding star graphs around all nodes i ∈ V where the potentials on the graph edges should sum to the original potential. It is easy to see that this will always result in an upper bound on the MAP value. The somewhat surprising result of the duality is that there exists a β assignment such that star decoding yields the optimal value of MAPLPR. 4 Block Coordinate Descent in the Dual To obtain a convergent algorithm we use a simple block coordinate descent strategy. At every iteration, fix all variables except a subset, and optimize over this subset. It turns out that this can be done in closed form for the cases we consider. We begin by deriving the EMPLP algorithm. Consider fixing all the β variables except those corresponding to some edge ij ∈ E (i.e., βij and βji ), and minimizing DMAPLPR over the non-fixed variables. Only two terms in the DMAPLPR objective depend on βij and βji . We can write those as f (βij , βji ) = max λ−j (xi ) + max βji (xj , xi ) + max λ−i (xj ) + max βij (xi , xj ) i j xi where we defined λ−j (xi ) = i xj k∈N (i)\j xi xi (4) λki (xi ) and λki (xi ) = maxxk βki (xk , xi ) as in App. A. Note that the function f (βij , βji ) depends on the other β values only through λ−i (xj ) and λ−j (xi ). j i This implies that the optimization can be done solely in terms of λij (xj ) and there is no need to store the β values explicitly. The optimal βij , βji are obtained by minimizing f (βij , βji ) subject to the re-parameterization constraint βji (xj , xi ) + βij (xi , xj ) = θij (xi , xj ). The following proposition characterizes the minimum of f (βij , βji ). In fact, as mentioned above, we do not need to characterize the optimal βij (xi , xj ) itself, but only the new λ values. Proposition 2 Maximizing the function f (βij , βji ) yields the following λji (xi ) (and the equivalent expression for λij (xj )) 1 −j 1 λji (xi ) = − λi (xi ) + max λ−i (xj ) + θij (xi , xj ) j 2 2 xj The proposition is proved in App. B. The λ updates above result in the EMPLP algorithm, described in Fig. 1. Note that since the β optimization affects both λji (xi ) and λij (xj ), both these messages need to be updated simultaneously. We proceed to derive the NMPLP algorithm. For a given node i ∈ V , we consider all its neighbors j ∈ N (i), and wish to optimize over the variables βji (xj , xi ) for ji, ij ∈ E (i.e., all the edges in a star centered on i), while the other variables are fixed. One way of doing so is to use the EMPLP algorithm for the edges in the star, and iterate it until convergence. We now show that the result of 3 Inputs: A graph G = (V, E), potential functions θij (xi , xj ) for each edge ij ∈ E. Initialization: Initialize messages to any value. Algorithm: • Iterate until a stopping criterion is satisfied: – Max-product: Iterate over messages and update (cji shifts the max to zero) h i mji (xi )← max m−i (xj ) + θij (xi , xj ) − cji j xj – EMPLP: For each ij ∈ E, update λji (xi ) and λij (xj ) simultaneously (the update for λij (xj ) is the same with i and j exchanged) h i 1 1 λji (xi )← − λ−j (xi ) + max λ−i (xj ) + θij (xi , xj ) j i 2 2 xj – NMPLP: Iterate over nodes i ∈ V and update all γij (xj ) where j ∈ N (i) 2 3 X 2 γij (xj )← max 4θij (xi , xj ) − γji (xi ) + γki (xi )5 xi |N (i)| + 1 k∈N(i) • Calculate node “beliefs”: Set biP i ) to be the sum of incoming messages into node i ∈ V (x (e.g., for NMPLP set bi (xi ) = k∈N(i) γki (xi )). Output: Return assignment x defined as xi = arg maxxi b(ˆi ). x ˆ Figure 1: The max-product, EMPLP and NMPLP algorithms. Max-product, EMPLP and NMPLP use mesP sages mij , λij and γij respectively. We use the notation m−i (xj ) = k∈N(j)\i mkj (xj ). j this optimization can be found in closed form. The assumption about β being fixed outside the star implies that λ−i (xj ) is fixed. Define: γji (xi ) = maxxj θij (xi , xj ) + λ−i (xj ) . Simple algebra j j yields the following relation between λ−j (xi ) and γki (xi ) for k ∈ N (i) i 2 λ−j (xi ) = −γji (xi ) + γki (xi ) (5) i |N (i)| + 1 k∈N (i) Plugging this into the definition of γji (xi ) we obtain the NMPLP update in Fig. 1. The messages for both algorithms can be initialized to any value since it can be shown that after one iteration they will correspond to valid β values. 5 Convergence Properties The MPLP algorithm decreases the dual objective (i.e., an upper bound on the MAP value) at every iteration, and thus its dual objective values form a convergent sequence. Using arguments similar to [5] it can be shown that MPLP has a limit point that is a fixed point of its updates. This in itself does not guarantee convergence to the dual optimum since coordinate descent algorithms may get stuck at a point that is not a global optimum. There are ways of overcoming this difficulty, for example by smoothing the objective [4] or using techniques as in [2] (see p. 636). We leave such extensions for further work. In this section we provide several results about the properties of the MPLP fixed points and their relation to the corresponding LP. First, we claim that if all beliefs have unique maxima then the exact MAP assignment is obtained. Proposition 3 If the fixed point of MPLP has bi (xi ) such that for all i the function bi (xi ) has a unique maximizer x∗ , then x∗ is the solution to the MAP problem and the LP relaxation is exact. i Since the dual objective is always greater than or equal to the MAP value, it suffices to show that there exists a dual feasible point whose objective value is f (x∗ ). Denote by β ∗ , λ∗ the value of the corresponding dual parameters at the fixed point of MPLP. Then the dual objective satisfies λ∗ (xi ) = ki max i xi k∈N (i) ∗ max βki (xk , x∗ ) = i i k∈N (i) xk ∗ βki (x∗ , x∗ ) = f (x∗ ) k i i 4 k∈N (i) To see why the second equality holds, note that bi (x∗ ) = maxxi ,xj λ−j (xi ) + βji (xj , xi ) and i i bj (x∗ ) = maxxi ,xj λ−i (xj ) + βij (xi , xj ). By the equalization property in Eq. 9 the arguments of j j the two max operations are equal. From the unique maximum assumption it follows that x∗ , x∗ are i j the unique maximizers of the above. It follows that βji , βij are also maximized by x∗ , x∗ . i j In the general case, the MPLP fixed point may not correspond to a primal optimum because of the local optima problem with coordinate descent. However, when the variables are binary, fixed points do correspond to primal solutions, as the following proposition states. Proposition 4 When xi are binary, the MPLP fixed point can be used to obtain the primal optimum. The claim can be shown by constructing a primal optimal solution µ∗ . For tied bi , set µ∗ (xi ) to 0.5 i and for untied bi , set µ∗ (x∗ ) to 1. If bi , bj are not tied we set µ∗ (x∗ , x∗ ) = 1. If bi is not tied but bj i i ij i j is, we set µ∗ (x∗ , xj ) = 0.5. If bi , bj are tied then βji , βij can be shown to be maximized at either ij i x∗ , x∗ = (0, 0), (1, 1) or x∗ , x∗ = (0, 1), (1, 0). We then set µ∗ to be 0.5 at one of these assignment i j i j ij ∗ pairs. The resulting µ∗ is clearly primal feasible. Setting δi = b∗ we obtain that the dual variables i (δ ∗ , λ∗ , β ∗ ) and primal µ∗ satisfy complementary slackness for the LP in Eq. 7 and therefore µ∗ is primal optimal. The binary optimality result implies partial decodability, since [6] shows that the LP is partially decodable for binary variables. 6 Beyond pairwise potentials: Generalized MPLP In the previous sections we considered maximizing functions which factor according to the edges of the graph. A more general setting considers clusters c1 , . . . , ck ⊂ {1, . . . , n} (the set of clusters is denoted by C), and a function f (x; θ) = c θc (xc ) defined via potentials over clusters θc (xc ). The MAP problem in this case also has an LP relaxation (see e.g. [11]). To define the LP we introduce the following definitions: S = {c ∩ c : c, c ∈ C, c ∩ c = ∅} is the set of intersection between clusters ˆ ˆ ˆ and S(c) = {s ∈ S : s ⊆ c} is the set of overlap sets for cluster c.We now consider marginals over the variables in c ∈ C and s ∈ S and require that cluster marginals agree on their overlap. Denote this set by ML (C). The LP relaxation is then to maximize µ · θ subject to µ ∈ ML (C). As in Sec. 4, we can derive message passing updates that result in monotone decrease of the dual LP of the above relaxation. The derivation is similar and we omit the details. The key observation is that one needs to introduce |S(c)| copies of each marginal µc (xc ) (instead of the two copies in the pairwise case). Next, as in the EMPLP derivation we assume all β are fixed except those corresponding to some cluster c. The resulting messages are λc→s (xs ) from a cluster c to all of its intersection sets s ∈ S(c). The update on these messages turns out to be: 1 1 λ−c (xs ) + max λ−c (xs ) + θc (xc ) λc→s (xs ) = − 1 − ˆ s s ˆ |S(c)| |S(c)| xc\s s∈S(c)\s ˆ where for a given c ∈ C all λc→s should be updated simultaneously for s ∈ S(c), and λ−c (xs ) is s defined as the sum of messages into s that are not from c. We refer to this algorithm as Generalized EMPLP (GEMPLP). It is possible to derive an algorithm similar to NMPLP that updates several clusters simultaneously, but its structure is more involved and we do not address it here. 7 Related Work Weiss et al. [11] recently studied the fixed points of a class of max-product like algorithms. Their analysis focused on properties of fixed points rather than convergence guarantees. Specifically, they showed that if the counting numbers used in a generalized max-product algorithm satisfy certain properties, then its fixed points will be the exact MAP if the beliefs have unique maxima, and for binary variables the solution can be partially decodable. Both these properties are obtained for the MPLP fixed points, and in fact we can show that MPLP satisfies the conditions in [11], so that we obtain these properties as corollaries of [11]. We stress however, that [11] does not address convergence of algorithms, but rather properties of their fixed points, if they converge. MPLP is similar in some aspects to Kolmogorov’s TRW-S algorithm [5]. TRW-S is also a monotone coordinate descent method in a dual of the LP relaxation and its fixed points also have similar 5 guarantees to those of MPLP [6]. Furthermore, convergence to a local optimum may occur, as it does for MPLP. One advantage of MPLP lies in the simplicity of its updates and the fact that it is parameter free. The other is its simple generalization to potentials over clusters of nodes (Sec. 6). Recently, several new dual LP algorithms have been introduced, which are more closely related to our formalism. Werner [12] presented a class of algorithms which also improve the dual LP at every iteration. The simplest of those is the max-sum-diffusion algorithm, which is similar to our EMPLP algorithm, although the updates are different from ours. Independently, Johnson et al. [4] presented a class of algorithms that improve duals of the MAP-LP using coordinate descent. They decompose the model into tractable parts by replicating variables and enforce replication constraints within the Lagrangian dual. Our basic formulation in Eq. 3 could be derived from their perspective. However, the updates in the algorithm and the analysis differ. Johnson et al. also presented a method for overcoming the local optimum problem, by smoothing the objective so that it is strictly convex. Such an approach could also be used within our algorithms. Vontobel and Koetter [9] recently introduced a coordinate descent algorithm for decoding LDPC codes. Their method is specifically tailored for this case, and uses updates that are similar to our edge based updates. Finally, the concept of dual coordinate descent may be used in approximating marginals as well. In [3] we use such an approach to optimize a variational bound on the partition function. The derivation uses some of the ideas used in the MPLP dual, but importantly does not find the minimum for each coordinate. Instead, a gradient like step is taken at every iteration to decrease the dual objective. 8 Experiments We compared NMPLP to three other message passing algorithms:2 Tree-Reweighted max-product (TRMP) [10],3 standard max-product (MP), and GEMPLP. For MP and TRMP we used the standard approach of damping messages using a factor of α = 0.5. We ran all algorithms for a maximum of 2000 iterations, and used the hit-time measure to compare their speed of convergence. This measure is defined as follows: At every iteration the beliefs can be used to obtain an assignment x with value f (x). We define the hit-time as the first iteration at which the maximum value of f (x) is achieved.4 We first experimented with a 10 × 10 grid graph, with 5 values per state. The function f (x) was 5 a Potts model: f (x) = The values for θij and θi (xi ) ij∈E θij I(xi = xj ) + i∈V θi (xi ). were randomly drawn from [−cI , cI ] and [−cF , cF ] respectively, and we used values of cI and cF in the range range [0.1, 2.35] (with intervals of 0.25), resulting in 100 different models. The clusters for GEMPLP were the faces of the graph [14]. To see if NMPLP converges to the LP solution we also used an LP solver to solve the LP relaxation. We found that the the normalized difference between NMPLP and LP objective was at most 10−3 (median 10−7 ), suggesting that NMPLP typically converged to the LP solution. Fig. 2 (top row) shows the results for the three algorithms. It can be seen that while all non-cluster based algorithms obtain similar f (x) values, NMPLP has better hit-time (in the median) than TRMP and MP, and MP does not converge in many cases (see caption). GEMPLP converges more slowly than NMPLP, but obtains much better f (x) values. In fact, in 99% of the cases the normalized difference between the GEMPLP objective and the f (x) value was less than 10−5 , suggesting that the exact MAP solution was found. We next applied the algorithms to the real world problems of protein design. In [13], Yanover et al. show how these problems can be formalized in terms of finding a MAP in an appropriately constructed graphical model.6 We used all algorithms except GNMPLP (since there is no natural choice for clusters in this case) to approximate the MAP solution on the 97 models used in [13]. In these models the number of states per variable is 2 − 158, and there are up to 180 variables per model. Fig. 2 (bottom) shows results for all the design problems. In this case only 11% of the MP runs converged, and NMPLP was better than TRMP in terms of hit-time and comparable in f (x) value. The performance of MP was good on the runs where it converged. 2 As expected, NMPLP was faster than EMPLP so only NMPLP results are given. The edge weights for TRMP corresponded to a uniform distribution over all spanning trees. 4 This is clearly a post-hoc measure since it can only be obtained after the algorithm has exceeded its maximum number of iterations. However, it is a reasonable algorithm-independent measure of convergence. 5 The potential θi (xi ) may be folded into the pairwise potential to yield a model as in Eq. 1. 6 Data available from http://jmlr.csail.mit.edu/papers/volume7/yanover06a/Rosetta Design Dataset.tgz 3 6 (a) (b) (c) 100 (d) 0.6 2000 0.04 0.4 0.02 −50 0 −0.02 −0.04 ∆(Value) 0 1000 ∆(Hit Time) ∆(Value) ∆(Hit Time) 50 0 MP TRMP GMPLP 0 −0.2 −1000 −0.4 −0.06 −100 0.2 MP TRMP GMPLP MP TRMP MP TRMP Figure 2: Evaluation of message passing algorithms on Potts models and protein design problems. (a,c): Convergence time results for the Potts models (a) and protein design problems (c). The box-plots (horiz. red line indicates median) show the difference between the hit-time for the other algorithms and NMPLP. (b,d): Value of integer solutions for the Potts models (b) and protein design problems (d). The box-plots show the normalized difference between the value of f (x) for NMPLP and the other algorithms. All figures are such that better MPLP performance yields positive Y axis values. Max-product converged on 58% of the cases for the Potts models, and on 11% of the protein problems. Only convergent max-product runs are shown. 9 Conclusion We have presented a convergent algorithm for MAP approximation that is based on block coordinate descent of the MAP-LP relaxation dual. The algorithm can also be extended to cluster based functions, which result empirically in improved MAP estimates. This is in line with the observations in [14] that generalized belief propagation algorithms can result in significant performance improvements. However generalized max-product algorithms [14] are not guaranteed to converge whereas GMPLP is. Furthermore, the GMPLP algorithm does not require a region graph and only involves intersection between pairs of clusters. In conclusion, MPLP has the advantage of resolving the convergence problems of max-product while retaining its simplicity, and offering the theoretical guarantees of LP relaxations. We thus believe it should be useful in a wide array of applications. A Derivation of the dual Before deriving the dual, we first express the constraint set ML (G) in a slightly different way. The definition of ML (G) in Sec. 2 uses a single distribution µij (xi , xj ) for every ij ∈ E. In what follows, we use two copies of this pairwise distribution for every edge, which we denote µij (xi , xj ) ¯ and µji (xj , xi ), and we add the constraint that these two copies both equal the original µij (xi , xj ). ¯ For this extended set of pairwise marginals, we consider the following set of constraints which is clearly equivalent to ML (G). On the rightmost column we give the dual variables that will correspond to each constraint (we omit non-negativity constraints). µij (xi , xj ) = µij (xi , xj ) ¯ µji (xj , xi ) = µij (xi , xj ) ¯ x xi µij (ˆi , xj ) = µj (xj ) ˆ ¯ µji (ˆj , xi ) = µi (xi ) ¯ x xj ˆ xi µi (xi ) = 1 ∀ij ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀ij ∈ E, xj ∀ji ∈ E, xi ∀i ∈ V βij (xi , xj ) βji (xj , xi ) λij (xj ) λji (xi ) δi (6) ¯ We denote the set of (µ, µ) satisfying these constraints by ML (G). We can now state an LP that ¯ is equivalent to MAPLPR, only with an extended set of variables and constraints. The equivalent ¯ problem is to maximize µ · θ subject to (µ, µ) ∈ ML (G) (note that the objective uses the original ¯ µ copy). LP duality transformation of the extended problem yields the following LP min i δi s.t. λij (xj ) − βij (xi , xj ) ≥ 0 βij (xi , xj ) + βji (xj , xi ) = θij (xi , xj ) − k∈N (i) λki (xi ) + δi ≥ 0 ∀ij, ji ∈ E, xi , xj ∀ij ∈ E, xi , xj ∀i ∈ V, xi (7) We next simplify the above LP by eliminating some of its constraints and variables. Since each variable δi appears in only one constraint, and the objective minimizes δi it follows that δi = maxxi k∈N (i) λki (xi ) and the constraints with δi can be discarded. Similarly, since λij (xj ) appears in a single constraint, we have that for all ij ∈ E, ji ∈ E, xi , xj λij (xj ) = maxxi βij (xi , xj ) and the constraints with λij (xj ), λji (xi ) can also be discarded. Using the eliminated δi and λji (xi ) 7 variables, we obtain that the LP in Eq. 7 is equivalent to that in Eq. 3. Note that the objective in Eq. 3 is convex since it is a sum of point-wise maxima of convex functions. B Proof of Proposition 2 We wish to minimize f in Eq. 4 subject to the constraint that βij + βji = θij . Rewrite f as f (βij , βji ) = max λ−j (xi ) + βji (xj , xi ) + max λ−i (xj ) + βij (xi , xj ) j i xi ,xj xi ,xj (8) The sum of the two arguments in the max is λ−j (xi ) + λ−i (xj ) + θij (xi , xj ) i j (because of the constraints on β). Thus the minimum must be greater than −j −i 1 2 maxxi ,xj λi (xi ) + λj (xj ) + θij (xi , xj ) . One assignment to β that achieves this minimum is obtained by requiring an equalization condition:7 λ−i (xj ) + βij (xi , xj ) = λ−j (xi ) + βji (xj , xi ) = j i 1 θij (xi , xj ) + λ−j (xi ) + λ−i (xj ) i j 2 (9) which implies βij (xi , xj ) = 1 θij (xi , xj ) + λ−j (xi ) − λ−i (xj ) and a similar expression for βji . i j 2 The resulting λij (xj ) = maxxi βij (xi , xj ) are then the ones in Prop. 2. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson was also supported by the Rothschild Yad-Hanadiv fellowship. References [1] M. Bayati, D. Shah, and M. Sharma. Maximum weight matching via max-product belief propagation. IEEE Trans. on Information Theory (to appear), 2007. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] A. Globerson and T. Jaakkola. Convergent propagation algorithms via oriented trees. In UAI. 2007. [4] J.K. Johnson, D.M. Malioutov, and A.S. Willsky. Lagrangian relaxation for map estimation in graphical models. In Allerton Conf. Communication, Control and Computing, 2007. [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1568–1583, 2006. [6] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message passing. In 21st Conference on Uncertainty in Artificial Intelligence (UAI). 2005. [7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [8] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and bregman projections. Journal of Machine Learning Research, pages 1627–1653, 2006. [9] P.O. Vontobel and R. Koetter. Towards low-complexity linear-programming decoding. In Proc. 4th Int. Symposium on Turbo Codes and Related Topics, 2006. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] Y. Weiss, C. Yanover, and T. Meltzer. Map estimation, linear programming and belief propagation with convex free energies. In UAI. 2007. [12] T. Werner. A linear programming approach to max-sum, a review. IEEE Trans. on PAMI, 2007. [13] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Jourmal of Machine Learning Research, 7:1887–1907, 2006. [14] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005. 7 Other solutions are possible but may not yield some of the properties of MPLP. 8
3 0.65200233 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
Author: Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar
Abstract: In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox’s partial likelihood. We devise two bounds on CI–one of which emerges directly from the properties of PH models–and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox’s partial likelihood also ends up (approximately) maximizing the CI. 1
4 0.59809411 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
Author: Jingrui He, Jaime G. Carbonell
Abstract: Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg’s Interleave method, the prior best technique in the sparse literature on this topic. 1
5 0.57265395 64 nips-2007-Cooled and Relaxed Survey Propagation for MRFs
Author: Hai L. Chieu, Wee S. Lee, Yee W. Teh
Abstract: We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1
6 0.54893941 128 nips-2007-Message Passing for Max-weight Independent Set
7 0.50752258 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
8 0.4759495 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
9 0.4720034 193 nips-2007-The Distribution Family of Similarity Distances
10 0.46057779 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
11 0.44847107 187 nips-2007-Structured Learning with Approximate Inference
12 0.42440078 141 nips-2007-New Outer Bounds on the Marginal Polytope
13 0.4159444 121 nips-2007-Local Algorithms for Approximate Inference in Minor-Excluded Graphs
14 0.40399444 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
15 0.40257195 32 nips-2007-Bayesian Co-Training
16 0.39903042 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
17 0.39379865 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
18 0.3858296 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
19 0.3774941 118 nips-2007-Learning with Transformation Invariant Kernels
20 0.37266999 8 nips-2007-A New View of Automatic Relevance Determination
topicId topicWeight
[(5, 0.033), (13, 0.029), (16, 0.015), (18, 0.013), (21, 0.075), (29, 0.012), (31, 0.031), (34, 0.029), (35, 0.029), (47, 0.08), (49, 0.012), (83, 0.138), (85, 0.045), (87, 0.013), (90, 0.046), (94, 0.335)]
simIndex simValue paperId paperTitle
same-paper 1 0.75365186 157 nips-2007-Privacy-Preserving Belief Propagation and Sampling
Author: Michael Kearns, Jinsong Tan, Jennifer Wortman
Abstract: We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms — distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else. 1
2 0.50222939 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
3 0.49690264 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
4 0.49508941 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum
Abstract: We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classifiers. Theoretical results provide new insights into relationships among popular classifiers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classification criterion and its kernel and local versions perform competitively against existing classifiers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-specific information. 1
5 0.49238661 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
6 0.49229968 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
7 0.49185279 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
8 0.49135101 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.4899452 69 nips-2007-Discriminative Batch Mode Active Learning
10 0.4894982 128 nips-2007-Message Passing for Max-weight Independent Set
11 0.4893862 156 nips-2007-Predictive Matrix-Variate t Models
12 0.4890047 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
13 0.48835012 86 nips-2007-Exponential Family Predictive Representations of State
14 0.48821679 187 nips-2007-Structured Learning with Approximate Inference
15 0.48812962 134 nips-2007-Multi-Task Learning via Conic Programming
16 0.48807654 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
17 0.48752537 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
18 0.4874289 97 nips-2007-Hidden Common Cause Relations in Relational Learning
19 0.48736179 141 nips-2007-New Outer Bounds on the Marginal Polytope
20 0.48705211 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs