nips nips2006 nips2006-43 knowledge-graph by maker-knowledge-mining

43 nips-2006-Bayesian Model Scoring in Markov Random Fields


Source: pdf

Author: Sridevi Parise, Max Welling

Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. [sent-5, score-0.245]

2 The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. [sent-6, score-0.094]

3 We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. [sent-7, score-0.579]

4 Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. [sent-9, score-0.105]

5 They offer a very natural setting in which to address issues such as overfitting which plague standard maximum likelihood approaches. [sent-11, score-0.039]

6 A full Bayesian approach has its computational challenges as it often involves intractable integrals. [sent-12, score-0.092]

7 The main reason for this discrepancy is the fact that MRF models have a normalization constant that depends on the parameters but is in itself intractable to compute, let alone integrate over. [sent-15, score-0.103]

8 In fact, the presence of this term even prevents one to draw samples from the posterior distribution in most situations except for some special cases1 . [sent-16, score-0.076]

9 In terms of approximating the posterior some new methods have become available recently. [sent-17, score-0.05]

10 In [7] a number of approximate MCMC samplers are proposed. [sent-18, score-0.046]

11 Two of them were reported to be most successful: one based on Langevin sampling with approximate gradients given by contrastive divergence and one where the acceptance probability is approximated by replacing the log partition function with the Bethe free energy. [sent-19, score-0.337]

12 To compute acceptance ratios for dimension-changing moves they need to estimate the partition function 1 If one can compute the normalization term exactly (e. [sent-22, score-0.109]

13 graphs with small treewidth) or if one can draw perfect samples from the MRF [8](e. [sent-24, score-0.063]

14 positive interactions only) then one construct a Markov chain for the posterior. [sent-26, score-0.06]

15 In [6] and [8] MCMC methods are proposed that use perfect samples to circumvent the calculation of the partition function altogether. [sent-28, score-0.064]

16 This method is elegant but limited in its application due to the need to draw perfect samples. [sent-29, score-0.063]

17 Moreover, two approaches that approximate the posterior by a Gaussian distribution are proposed in [11] (based on expectation propagation) and [13] (based on the Bethe-Laplace approximation). [sent-30, score-0.096]

18 In this paper we focus on a different problem, namely that of approximating the marginal likelihood. [sent-31, score-0.076]

19 This quantity is at the heart of Bayesian analysis because it allows one to compare models of different structure. [sent-32, score-0.036]

20 Even if one has an approximation to the posterior distribution it is not at all obvious how to use it to compute a good estimate for the marginal likelihood. [sent-34, score-0.189]

21 The most direct approach is to use samples from the posterior and compute importance weights, p(D) ≈ 1 N N p(D|θn )p(θn )/Q(θn |D) θn ∼ Q(θn |D) (1) n=1 where Q(θn |D) denotes the approximate posterior. [sent-35, score-0.156]

22 We propose to use the Laplace approximation, including all O(1) terms where the intractable quantities of interest are approximated by either belief propagation (BP) or the linear response theorem based on the solution of BP. [sent-38, score-0.356]

23 Their inclusion can improve accuracy to up to two orders of magnitude. [sent-40, score-0.069]

24 At the same time we observe that as a function of N , the O(1)-term based on the covariance between features deteriorates and should be omitted for large N . [sent-41, score-0.106]

25 For any biased estimate of the parameters this phenomenon is therefore bound to happen as we increase N because the variance of the posterior distribution is expected to decrease with N . [sent-43, score-0.128]

26 In summary we present a very accurate estimate for the marginal likelihood where it is most needed, i. [sent-44, score-0.115]

27 This work seems to be the first practical method for estimating the marginal evidence in undirected graphical models. [sent-47, score-0.17]

28 2 The Bethe-Laplace Approximation for log p(D) Without loss of generality we represent a MRF as a log-linear model, p(x|λ) = 1 exp λT f (x) Z(λ) (2) where f (x) represent features. [sent-48, score-0.064]

29 To score a structure we will follow the Bayesian paradigm and aim to compute the log-marginal likelihood log p(D) where D represents a dataset of size N , log p(D) = log dλ p(D|λ) p(λ) (3) where p(λ) is some arbitrary prior on the parameters λ. [sent-51, score-0.336]

30 In order to approximate this quantity we employ two approximations. [sent-52, score-0.046]

31 For the log-likelihood this boils down to expanding the log-partition function, 1 log Z(λ) ≈ log Z(λMP ) + κT δλ + δλT Cδλ 2 (4) with δλ = (λ − λMP ) and C = E[f (x)f (x)T ]p(x) − E[f (x)]p(x) E[f (x)]T , p(x) κ = E[f (x)]p(x) (5) and where all averages are taken over p(x|λMP ). [sent-54, score-0.128]

32 Similarly for the prior we find, 1 log p(λ) = log p(λMP ) + gT δλ + δλT Hδλ (6) 2 where g is the first derivative of log p evaluated at λMP and H is the second derivative (or Hessian). [sent-55, score-0.256]

33 The marginal likelihood can now be approximated by integrating out the fluctuations δλ, considering λMP as a hyper-parameter, log p(D) = log dδλ p(D|δλ, λMP ) p(δλ|λMP ) (7) Inserting the expansions eqns. [sent-57, score-0.277]

34 7 we arrive at the standard expression for the Laplace approximation applied to MRFs, log p(D) ≈ n (8) 1 1 1 H T λMP f (xn ) − N log Z(λMP ) + log p(λMP ) + F log(2π) − F log(N ) − log det(C − ) 2 2 2 N with F the number of features. [sent-59, score-0.321]

35 The difference with Laplace approximations for Bayesian networks is the fact that many terms in the expression above can not be evaluated. [sent-60, score-0.06]

36 Secondly, the expression contains the log-partition function Z(λMP ) and the covariance matrix C which are both intractable quantities. [sent-62, score-0.134]

37 1 The BP-Linear Response Approximation To make further progress, we introduce a second layer of approximations based on belief propagation. [sent-64, score-0.105]

38 In particular, we approximate the required marginals in the gradient for λMP with the ones obtained with BP. [sent-65, score-0.046]

39 The approximation incurred by PMM is not always small [10] in which case other approximations such as contrastive divergence may be substituted instead. [sent-68, score-0.143]

40 The term − log Z(λMP ) will be approximated with the Bethe free energy. [sent-69, score-0.13]

41 This will involve running belief propagation on a model with parameters λMP and inserting the beliefs at their fixed points into the expression for the Bethe free energy [16]. [sent-70, score-0.288]

42 To compute the covariance matrix between the features C (eqn. [sent-71, score-0.096]

43 This approximation is based on the observation that C is the Hessian of the logpartition function w. [sent-73, score-0.037]

44 This is approximated by the Hessian of the Bethe free energy w. [sent-77, score-0.066]

45 the parameters which in turn depends to the partial derivatives of the beliefs from BP w. [sent-80, score-0.039]

46 Cαβ = ∂ 2 log Z(λ) ∂ 2 log FBethe (λ) ≈− = ∂λα ∂λβ ∂λα ∂λβ fα (xα ) xα ∂pBP (xα |λ) α ∂λβ (9) where λ = λMP , pBP is the marginal computed using belief propagation and xα is the collection of α variables in the argument of feature fα (e. [sent-84, score-0.366]

47 This approximate C is also guaranteed to be symmetric and positive semi-definite. [sent-87, score-0.046]

48 In [15] two algorithms were discussed to compute C in the linear response approximation, one based on a matrix inverse, the other a local propagation algorithm. [sent-88, score-0.179]

49 The main idea is to perform a Taylor expansion of the beliefs and messages in the parameters δλ = λ − λMP and keep track of first order terms in the belief propagation equations. [sent-89, score-0.201]

50 One can show that the first order terms carry the information to compute the covariance matrix. [sent-90, score-0.065]

51 11 −2 0 (a) 2 4 6 8 #edges (nested models) 10 12 (b) Figure 1: Comparision of various scores on synthetic data 3 Conditional Random Fields Perhaps the most practical class of undirected graphical models are the conditional random field (CRF) models. [sent-110, score-0.283]

52 The probability of label given input is given as, 1 exp λT f (t, x) (10) p(t|x, λ) = Z(λ, x) To approximate the log marginal evidence we obtain an expression very similar to eqn. [sent-113, score-0.214]

53 In these experiments we have focussed on comparing the value of the estimated log marginal likelihood with “annealed importance sampling” (AIS), which we treat as ground truth[9, 1]. [sent-116, score-0.253]

54 We have focussed on this performance measure because the marginal likelihood is the relevant quantity for both Bayesian model averaging as well as model selection. [sent-117, score-0.155]

55 We perform experiments on synthetic data as well as a real-world dataset. [sent-118, score-0.038]

56 05 0 −2 1 0 2 4 6 8 #edges (nested models) (a) 10 12 0 −2 0 2 4 6 8 #edges (nested models) 10 12 (b) Figure 2: Mean difference in scores with AIS (synthetic data). [sent-131, score-0.115]

57 Scores computed using the proposed method (BP-LR) were compared against MAP scores (or penalized log-likelihood) where we retain only the first three terms in equation (8) and the commonly used BIC-ML scores where we ignore all O(1) terms (i. [sent-133, score-0.253]

58 BIC-ML uses the maximum likelihood value λML instead of λMP . [sent-135, score-0.039]

59 We also evaluate two other scores - BPLR-ExactGrad where we use exact gradients to compute the λMP and Laplace-Exact which is same as BP-LR-ExactGrad but with C computed exactly as well. [sent-136, score-0.221]

60 Nevertheless they are useful here to illustrate the effect of the bias from BP. [sent-138, score-0.041]

61 We then generated N = 10000 4 4 4 4 4 4 samples from each of these (50 × 6) models using exact sampling by exhaustive enumeration. [sent-149, score-0.086]

62 5 (the true structure had 6 4 edges) and studied the variation of different scores with model complexity. [sent-151, score-0.144]

63 We define an ordering on models based on complexity by using nested model sequences. [sent-152, score-0.114]

64 These are such that a model appearing later in the sequence contains all edges from models appearing earlier. [sent-153, score-0.147]

65 Figure (1) shows the results for two such random nested sequences around the true model, for the number of datacases N = 50 and N = 10000 respectively. [sent-154, score-0.078]

66 Figure (2) shows the average absolute difference of each score with the AIS score over 50 sequences. [sent-157, score-0.192]

67 In order to better understand the performance of various scores with N , we took the datasets at d = 0. [sent-161, score-0.115]

68 At each value, we find the absolute difference 4 in the score assigned to the true structure with the corresponding AIS score. [sent-163, score-0.113]

69 We note that all BP-LR methods are about two orders of magnitude more accurate than methods that ignore the O(1) term based on C. [sent-166, score-0.039]

70 This does not happen with both BP-LR methods based on λMP computed using exact gradients (i. [sent-168, score-0.115]

71 Since the latter two methods perform identically, we conclude that it is not the approximation of C by linear response that breaks down, but rather that the bias in λMP is the reason that the estimate of C becomes unreliable. [sent-171, score-0.142]

72 We conjecture that this happens when the bias becomes of the order of the standard deviation of the posterior distribution. [sent-172, score-0.126]

73 with AIS BIC−ML MAP BP−LR BP−LR−ExactGrad Laplace−Exact 10 −2 10 −3 10 −1 10 BIC−ML MAP BP−LR BP−LR−ExactGrad Laplace−Exact −2 10 −3 10 −4 −4 10 −2000 N=10k 0 10 10 0 2000 4000 6000 N (# samples) 8000 10000 12000 Figure 3: Variation of score accuracy with N 10 0 0. [sent-176, score-0.109]

74 5 Figure 4: Variation of score accuracy with d constant but the variance in the posterior decreases as O(1/N ) this phenomenon is bound to happen for some value of N . [sent-179, score-0.237]

75 Finally since our BP-LR method relies on the BP approximation which is known to break down at strong interactions, we investigated the performance of various scores with d. [sent-180, score-0.152]

76 Again at each value of d we compute the average absolute difference in the scores assigned to the true structure by a method and AIS. [sent-181, score-0.175]

77 The exact methods show that one can improve performance by having a more accurate estimate of λMP . [sent-185, score-0.05]

78 These are used to define state and transition features using: fia (ti , xi ) = ti g a (xi ) and fia (ti , ti+1 , xi ) = ti ti+1 g a (xi ) where i denotes the line in a document and a indexes the 24 features. [sent-193, score-0.287]

79 We generated a random sequence of models by incrementally adding some state features and then some transition features. [sent-194, score-0.067]

80 We then score each model using MAP, BIC-MAP (which is same as BICML but with λMP ), AIS and Laplace-Exact. [sent-195, score-0.079]

81 (Another less relevant observation is that the scores flatten out around the point where we stop adding the state features showing their importance compared to transition features). [sent-200, score-0.18]

82 5 Discussion The main conclusion from this study is that the Bethe-Laplace approximation can give an excellent approximation to the marginal likelihood for small datasets. [sent-201, score-0.189]

83 We discovered an interesting phenomenon, namely that as N grows the error in the O(1) term based on the covariance between features increases. [sent-202, score-0.07]

84 We found that this term can give an enormous boost in accuracy for small N (up to two orders of magnitude), but its effect can be detrimental for large N . [sent-203, score-0.069]

85 We conjecture that this switch-over point takes place when the bias in λMP becomes of the order of the standard deviation in the posterior (which decreases as 1/N ). [sent-204, score-0.126]

86 Linear response results are also 2 Downloaded from: http://www. [sent-208, score-0.064]

87 edu/∼ mccallum/data/faqdata/ CRF N=2, Sequence Length=100 −15 −20 −25 score/N −30 −35 −40 −45 MAP BIC_MAP Laplace−Exact AIS −50 −55 −60 0 10 20 30 # features 40 50 Figure 5: Comparision of various scores on real-world dataset available for this case [12]. [sent-211, score-0.146]

88 A second improvement could come from improving the estimate of λMP using alternative learning techniques such as contrastive divergence or alternative sample-based approaches. [sent-212, score-0.074]

89 As discussed above, less bias in λMP will make the covariance term useful for larger N . [sent-213, score-0.08]

90 A Computation of C for Boltzman Machines For binary variables and pairwise interactions we define the variables as λ = {θi , wij } where θi is a parameter multiplying the node-feature xi and wij the parameter multiplying the edge feature xi xj . [sent-217, score-0.26]

91 Moreover, we’ll define the following independent quantities qi = p(xi = 1) and ξij = p(xi = 1, xj = 1). [sent-218, score-0.309]

92 In the following we will assume that {qi , ξij } are computed using belief propagation (BP). [sent-222, score-0.162]

93 To compute the covariance matrix we first compute its inverse from eqns. [sent-224, score-0.091]

94 The variational bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. [sent-232, score-0.138]

95 Tree-reweighted belief propagation algorithms and approximate ml estimation via pseudo-moment matching. [sent-259, score-0.288]

96 An efficient Markov chain Monte Carlo method for distributions with intractable normalisation constants. [sent-266, score-0.093]

97 Bayesian learning in undirected graphical models: approximate MCMC algorithms. [sent-272, score-0.14]

98 Probabilistic inference by means of cluster variation method and linear response theory. [sent-302, score-0.093]

99 Linear response algorithms for approximate inference in graphical models. [sent-319, score-0.154]

100 Constructing free energy approximations and generalized belief propagation algorithms. [sent-326, score-0.226]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mp', 0.591), ('ais', 0.321), ('bp', 0.303), ('qi', 0.28), ('lr', 0.167), ('ij', 0.154), ('qj', 0.152), ('laplace', 0.138), ('exactgrad', 0.137), ('scores', 0.115), ('bic', 0.112), ('bethe', 0.091), ('propagation', 0.089), ('ml', 0.08), ('score', 0.079), ('nested', 0.078), ('marginal', 0.076), ('belief', 0.073), ('xn', 0.068), ('intractable', 0.067), ('edges', 0.065), ('log', 0.064), ('response', 0.064), ('ik', 0.06), ('boltzman', 0.06), ('pmm', 0.06), ('welling', 0.059), ('wij', 0.059), ('bayesian', 0.057), ('mrf', 0.056), ('map', 0.055), ('ti', 0.052), ('mcmc', 0.051), ('exact', 0.05), ('posterior', 0.05), ('undirected', 0.05), ('irvine', 0.049), ('qk', 0.048), ('contrastive', 0.048), ('mrfs', 0.048), ('approximate', 0.046), ('cxn', 0.046), ('fia', 0.046), ('parise', 0.046), ('pbp', 0.046), ('crf', 0.044), ('graphical', 0.044), ('phenomenon', 0.043), ('bias', 0.041), ('focussed', 0.04), ('comparision', 0.04), ('covariance', 0.039), ('orders', 0.039), ('beliefs', 0.039), ('likelihood', 0.039), ('synthetic', 0.038), ('approximation', 0.037), ('scoring', 0.037), ('perfect', 0.037), ('models', 0.036), ('deteriorates', 0.036), ('conjecture', 0.035), ('happen', 0.035), ('importance', 0.034), ('interactions', 0.034), ('approximated', 0.034), ('absolute', 0.034), ('bren', 0.034), ('approximations', 0.032), ('derivative', 0.032), ('markov', 0.032), ('free', 0.032), ('features', 0.031), ('accuracy', 0.03), ('acceptance', 0.03), ('annealed', 0.03), ('xi', 0.03), ('gradients', 0.03), ('hessian', 0.029), ('jk', 0.029), ('murray', 0.029), ('variation', 0.029), ('quantities', 0.029), ('expression', 0.028), ('uctuations', 0.028), ('inef', 0.028), ('inserting', 0.027), ('partition', 0.027), ('compute', 0.026), ('draw', 0.026), ('divergence', 0.026), ('chain', 0.026), ('challenges', 0.025), ('zi', 0.024), ('les', 0.024), ('multiplying', 0.024), ('appearing', 0.023), ('francisco', 0.023), ('retain', 0.023), ('tn', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

Author: Sridevi Parise, Max Welling

Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1

2 0.21573989 47 nips-2006-Boosting Structured Prediction for Imitation Learning

Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff

Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1

3 0.13486966 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

4 0.12368354 57 nips-2006-Conditional mean field

Author: Peter Carbonetto, Nando D. Freitas

Abstract: Despite all the attention paid to variational methods based on sum-product message passing (loopy belief propagation, tree-reweighted sum-product), these methods are still bound to inference on a small set of probabilistic models. Mean field approximations have been applied to a broader set of problems, but the solutions are often poor. We propose a new class of conditionally-specified variational approximations based on mean field theory. While not usable on their own, combined with sequential Monte Carlo they produce guaranteed improvements over conventional mean field. Moreover, experiments on a well-studied problem— inferring the stable configurations of the Ising spin glass—show that the solutions can be significantly better than those obtained using sum-product-based methods. 1

5 0.11891888 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external field). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various fields such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external field case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external field, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Definitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } defined as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, defined over pairs of variables. The P partition function is defined as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We define those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then defined via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is defined as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we define E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are defined as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is defined in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is defined as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is defined as the product of the weights of the edges in the matching. Finally, we recall the definition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is defined as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, suffice to describe M(G) (see [3] page 434). The triangle inequalities are defined by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The definition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We briefly review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The first step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We define an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a first step in calculating a partition function over GT , we introduce the following definition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is defined as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by finding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Define the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Define Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfaffian orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Define the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Define G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following definitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisfies the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Define the optimal bound for a fixed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, define the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact find the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efficiently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difficulty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope defined in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efficiently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , find θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to find the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external field. Original graph is shown on the left. The field vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the field vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Specifically, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as defined above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We first characterize the optimum of Equation 7 for a fixed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisfies τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well defined set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We first P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external field. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for field parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various field parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the field node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The field parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difficult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is significantly better than TRW for almost all parameter settings, although the difference becomes smaller for large field values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a fixed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the field is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given field. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external fields. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to find one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

6 0.10612841 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

7 0.095130175 155 nips-2006-Optimal Single-Class Classification Strategies

8 0.088296615 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

9 0.086746059 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

10 0.086150512 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

11 0.084711209 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

12 0.070370592 169 nips-2006-Relational Learning with Gaussian Processes

13 0.064632997 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing

14 0.058067251 159 nips-2006-Parameter Expanded Variational Bayesian Methods

15 0.057663921 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

16 0.054955121 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

17 0.054745115 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

18 0.054307058 20 nips-2006-Active learning for misspecified generalized linear models

19 0.052217443 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

20 0.050830293 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.187), (1, 0.031), (2, 0.033), (3, -0.097), (4, 0.096), (5, 0.03), (6, 0.163), (7, 0.033), (8, -0.077), (9, -0.036), (10, -0.003), (11, -0.079), (12, 0.064), (13, 0.062), (14, 0.0), (15, 0.095), (16, -0.089), (17, 0.085), (18, 0.091), (19, 0.051), (20, 0.172), (21, 0.153), (22, -0.059), (23, 0.119), (24, 0.111), (25, 0.215), (26, 0.092), (27, -0.01), (28, -0.05), (29, -0.124), (30, -0.087), (31, -0.076), (32, -0.013), (33, 0.173), (34, -0.025), (35, -0.081), (36, 0.139), (37, 0.06), (38, 0.016), (39, -0.059), (40, 0.035), (41, -0.093), (42, -0.132), (43, -0.048), (44, 0.092), (45, 0.075), (46, 0.009), (47, -0.091), (48, -0.08), (49, 0.195)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93751222 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

Author: Sridevi Parise, Max Welling

Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1

2 0.64845371 47 nips-2006-Boosting Structured Prediction for Imitation Learning

Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff

Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classification or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1

3 0.50960362 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

Abstract: In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random field (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efficiently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efficient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C OMPOSE to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.

4 0.47317058 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

5 0.43448511 35 nips-2006-Approximate inference using planar graph decomposition

Author: Amir Globerson, Tommi S. Jaakkola

Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external field). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various fields such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external field case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external field, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Definitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } defined as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, defined over pairs of variables. The P partition function is defined as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We define those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then defined via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is defined as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we define E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are defined as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is defined in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is defined as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is defined as the product of the weights of the edges in the matching. Finally, we recall the definition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is defined as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, suffice to describe M(G) (see [3] page 434). The triangle inequalities are defined by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The definition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We briefly review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The first step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We define an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a first step in calculating a partition function over GT , we introduce the following definition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is defined as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by finding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Define the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Define Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfaffian orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Define the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Define G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following definitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisfies the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Define the optimal bound for a fixed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, define the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact find the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efficiently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difficulty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope defined in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efficiently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , find θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to find the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external field. Original graph is shown on the left. The field vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the field vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Specifically, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as defined above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We first characterize the optimum of Equation 7 for a fixed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisfies τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well defined set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We first P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external field. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for field parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various field parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the field node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The field parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difficult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is significantly better than TRW for almost all parameter settings, although the difference becomes smaller for large field values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a fixed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the field is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given field. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external fields. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to find one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.

6 0.4303149 57 nips-2006-Conditional mean field

7 0.40085459 155 nips-2006-Optimal Single-Class Classification Strategies

8 0.38428068 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

9 0.38087252 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

10 0.37220466 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

11 0.3705126 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures

12 0.35269585 169 nips-2006-Relational Learning with Gaussian Processes

13 0.35135996 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

14 0.30768487 1 nips-2006-A Bayesian Approach to Diffusion Models of Decision-Making and Response Time

15 0.30685112 174 nips-2006-Similarity by Composition

16 0.30555212 135 nips-2006-Modelling transcriptional regulation using Gaussian Processes

17 0.30026412 41 nips-2006-Bayesian Ensemble Learning

18 0.29923883 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

19 0.28581929 98 nips-2006-Inferring Network Structure from Co-Occurrences

20 0.28498977 139 nips-2006-Multi-dynamic Bayesian Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.086), (3, 0.016), (7, 0.109), (9, 0.034), (20, 0.031), (22, 0.06), (35, 0.245), (44, 0.088), (57, 0.089), (65, 0.048), (69, 0.08), (90, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81781131 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

Author: Sridevi Parise, Max Welling

Abstract: Scoring structures of undirected graphical models by means of evaluating the marginal likelihood is very hard. The main reason is the presence of the partition function which is intractable to evaluate, let alone integrate over. We propose to approximate the marginal likelihood by employing two levels of approximation: we assume normality of the posterior (the Laplace approximation) and approximate all remaining intractable quantities using belief propagation and the linear response approximation. This results in a fast procedure for model scoring. Empirically, we find that our procedure has about two orders of magnitude better accuracy than standard BIC methods for small datasets, but deteriorates when the size of the dataset grows. 1

2 0.72686654 168 nips-2006-Reducing Calibration Time For Brain-Computer Interfaces: A Clustering Approach

Author: Matthias Krauledat, Michael Schröder, Benjamin Blankertz, Klaus-Robert Müller

Abstract: Up to now even subjects that are experts in the use of machine learning based BCI systems still have to undergo a calibration session of about 20-30 min. From this data their (movement) intentions are so far infered. We now propose a new paradigm that allows to completely omit such calibration and instead transfer knowledge from prior sessions. To achieve this goal we first define normalized CSP features and distances in-between. Second, we derive prototypical features across sessions: (a) by clustering or (b) by feature concatenation methods. Finally, we construct a classifier based on these individualized prototypes and show that, indeed, classifiers can be successfully transferred to a new session for a number of subjects.

3 0.63029563 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

Author: Graham Mcneill, Sethu Vijayakumar

Abstract: Correspondence algorithms typically struggle with shapes that display part-based variation. We present a probabilistic approach that matches shapes using independent part transformations, where the parts themselves are learnt during matching. Ideas from semi-supervised learning are used to bias the algorithm towards finding ‘perceptually valid’ part structures. Shapes are represented by unlabeled point sets of arbitrary size and a background component is used to handle occlusion, local dissimilarity and clutter. Thus, unlike many shape matching techniques, our approach can be applied to shapes extracted from real images. Model parameters are estimated using an EM algorithm that alternates between finding a soft correspondence and computing the optimal part transformations using Procrustes analysis.

4 0.62660497 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1

5 0.62623852 34 nips-2006-Approximate Correspondences in High Dimensions

Author: Kristen Grauman, Trevor Darrell

Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1

6 0.62512594 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

7 0.62367696 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

8 0.62282383 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

9 0.62252158 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

10 0.62133127 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

11 0.62126106 158 nips-2006-PG-means: learning the number of clusters in data

12 0.61853778 80 nips-2006-Fundamental Limitations of Spectral Clustering

13 0.61848509 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

14 0.61802787 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

15 0.61779499 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

16 0.61771321 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

17 0.6166659 174 nips-2006-Similarity by Composition

18 0.61622447 167 nips-2006-Recursive ICA

19 0.6151439 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

20 0.61407447 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment