nips nips2011 nips2011-17 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. [sent-14, score-0.208]
2 We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. [sent-17, score-0.301]
3 1 Introduction We propose a novel and general method to approximate the partition function of intricate probability distributions defined over combinatorial spaces. [sent-18, score-0.281]
4 Computing the partition function is a notoriously hard computational problem. [sent-19, score-0.264]
5 The partition function for planar graphs with binary variables and no external field can also be computed in polynomial time [2]. [sent-22, score-0.296]
6 We will consider an adaptive MCMC sampling strategy, inspired by the Wang-Landau method [3], which is a so-called flat histogram sampling strategy from statistical physics. [sent-23, score-0.184]
7 We propose two key improvements to the Wang-Landau method, namely energy saturation and a focused-random walk component, leading to a new and more efficient algorithm called FocusedFlatSAT. [sent-25, score-0.55]
8 Energy saturation allows the chain to visit fewer energy levels, and the random walk style moves reduce the number of “null moves” in the Markov chain. [sent-26, score-0.802]
9 We demonstrate the effectiveness of our approach by a comparison with state-of-the-art methods to approximate the partition function or bound it, such as Tree Reweighed Belief Propagation [4], IJGPSampleSearch [5], and Gibbs sampling [6]. [sent-28, score-0.251]
10 The density of states serves as a rich description of the underlying probabilistic model. [sent-30, score-0.227]
11 Once computed, it can be used to efficiently evaluate the partition function for all parameter settings without ∗ Supported by NSF Expeditions in Computing award for Computational Sustainability (grant 0832782). [sent-31, score-0.208]
12 1 the need for further inference steps — a stark contrast with competing methods for partition function computation. [sent-32, score-0.208]
13 For instance, in statistical physics applications, we can use it to evaluate the partition function Z(T ) for all values of the temperature T . [sent-33, score-0.208]
14 , wK of its K first order formulas, we can parameterize the partition function as Z(w1 , . [sent-38, score-0.235]
15 2 Probabilistic model and the partition function We focus on intricate probability distributions defined over a set of configurations, i. [sent-43, score-0.237]
16 Such constraints can be either hard or soft, with the i-th soft constraint Ci being associated with a weight wi . [sent-50, score-0.288]
17 The probability Pw (x) of x is defined as 0 if x violates any hard constraint, and as Pw (x) = 1 exp − Z(w) wi χi (x) (1) Ci ∈Csoft otherwise, where Csoft is the set of soft constraints. [sent-52, score-0.198]
18 The partition function, Z(w), is simply the normalization constant for this probability distribution, and is given by: exp − Z(w) = wi χi (x) (2) Ci ∈Csoft x∈Xhard where Xhard ⊆ {0, 1}N is the set of configurations satisfying all hard constraints. [sent-53, score-0.301]
19 Note that as wi → ∞, the soft constraint Ci effectively becomes a hard constraint. [sent-54, score-0.156]
20 The partition function is a very important quantity but computing it is a well-known computational challenge, which we propose to address by employing the “density of states” method to be discussed shortly. [sent-57, score-0.236]
21 We will compare our approach against several state-of-the-art methods available for computing the partition function or obtaining bounds on it. [sent-58, score-0.265]
22 [4], for example, proposed a variational method known as tree re-weighting (TRW) to obtain bounds on the partition function of graphical models. [sent-60, score-0.264]
23 Unlike standard Belief Propagation schemes which are based on Bethe free energies [10], the TRW approach uses a tree-reweighted (TRW) free energy which consists of a linear combination of free energies defined on spanning trees of the model. [sent-61, score-0.437]
24 Using convexity arguments it is then possible to obtain upper bounds on various quantities, such as the partition function. [sent-62, score-0.237]
25 , partition function computation with a subset of “evidence” variables fixed) for general graphical models. [sent-65, score-0.235]
26 An alternative approach is to use Gibbs sampling to estimate the partition function by estimating, using sample average, a sequence of multipliers that correspond to the ratios of the partition function evaluated at different weight levels [6]. [sent-67, score-0.584]
27 Lastly, the partition function for planar graphs where all variables are binary and have only pairwise interactions (i. [sent-68, score-0.296]
28 Although we are interested in algorithms for the general (intractable) case, we used the software associated with this approach to obtain the ground truth for planar graphs and evaluate the accuracy of the estimates obtained by other methods. [sent-71, score-0.209]
29 2 3 Density of states Our approach for computing the partition function is based on solving the density of states problem. [sent-72, score-0.566]
30 Given a combinatorial space such as the one defined earlier and an energy function E : {0, 1}N → R, the density of states (DOS) n is a function n : range(E) → N that maps energy levels to the number of configurations with that energy, i. [sent-73, score-0.996]
31 In our context, we are interested in computing the number of configurations that satisfy certain properties that are specified using an appropriate energy function. [sent-76, score-0.365]
32 For instance, we might define the energy E(σ) of a configuration σ to be the number of hard constraints that are violated by σ. [sent-77, score-0.52]
33 , the number of configurations at each possible energy level, it is straightforward to evaluate the partition function Z(w) for any weight vector w, by summing up terms of the form n(i) exp(−E(i)), where E(i) denotes the energy of every configuration in state i. [sent-81, score-0.956]
34 This is the method we use in this work for estimating the partition function. [sent-82, score-0.208]
35 More complex energy functions may be defined for other related tasks, such as weight learning, i. [sent-83, score-0.411]
36 Here we can define the energy E(σ) to be w · , where = ( 1 , . [sent-86, score-0.337]
37 , M ) gives the number of constraints of weight wi violated by σ. [sent-89, score-0.238]
38 Our focus in the rest of the paper will thus be on computing the density of states efficiently. [sent-90, score-0.255]
39 1 The MCMCFlatSAT algorithm MCMCFlatSAT [11] is an Adaptive Markov Chain Monte Carlo (adaptive MCMC) method for computing the density of states for combinatorial problems, inspired by the Wang-Landau algorithm [3] from statistical physics. [sent-92, score-0.299]
40 The algorithm is based on the flat histogram idea and works by trying to construct a reversible Markov Chain on the space {0, 1}N of all configurations such that the steady state probability of a configuration σ is inversely proportional to the density of states n(E(σ)). [sent-95, score-0.392]
41 In this way, the stationary distribution is such that all the energy levels are visited equally often (i. [sent-96, score-0.449]
42 , when we count the visits to each energy level, we see a flat visit histogram). [sent-98, score-0.415]
43 This means1 that the Markov Chain will reach a stationary probability distribution P (regardless of the initial state) such that the probability of a configuration σ with energy E = E(σ) is inversely proportional to the number of configurations with energy E. [sent-102, score-0.73]
44 This leads to an asymptotically flat histogram of the energies 1 of the states visited because P (E) = σ:E(σ)=E P (σ) ∝ n(E) n(E) = 1 (i. [sent-103, score-0.284]
45 Since the density of states is not known a priori, and computing it is precisely the goal of the algorithm, it is not possible to construct directly a random walk with transition probability (3). [sent-106, score-0.336]
46 However it is possible to start with an initial guess g(·) for n(·) and keep updating this estimate g(·) in a systematic way to produce a flat energy histogram and simultaneously make the estimate g(E) converge to the true value n(E) for every energy level E. [sent-107, score-0.812]
47 3 Algorithm 1 MCMCFlatSAT algorithm to compute the density of states 1: 2: 3: 4: 5: 6: 7: 8: 9: Start with a guess g(E) = 1 for all E = 1, . [sent-111, score-0.227]
48 The time needed for each iteration of MCMCFlatSAT to converge is significantly affected by the number of different non-empty energy levels (buckets). [sent-122, score-0.388]
49 , there is an incentive to satisfy the constraints), and as an effect of the exponential discounting in Equation (1), configurations that violate a large number of constraints have a negligible contribution to the sum defining the partition function Z. [sent-125, score-0.266]
50 We therefore define a new saturated energy function E (σ) = min{E(σ), K}, where K is a user-defined parameter. [sent-126, score-0.373]
51 For the positive weights case, the partition function Z associated with the saturated energy function is a guaranteed upper bound on the original Z, for any K. [sent-127, score-0.581]
52 When all constraints are hard, Z = Z for any value K ≥ 1 because only the first energy bucket matters. [sent-128, score-0.395]
53 In general, when soft constraints are present, the bound gets tighter as K increases, and we can obtain theoretical worst-case error bounds when K is chosen to be a percentile of the energy distribution (e. [sent-129, score-0.487]
54 In our experiments, we set K to be the average number of constraints violated by a random configuration, and we found that the error introduced by the saturation is negligible compared to other inherent approximations in density of states estimation. [sent-132, score-0.486]
55 Intuitively, this is because the states where the probability is concentrated turn out to typically have a much lower energy than K, and thus an exponentially larger contribution to Z. [sent-133, score-0.44]
56 Furthermore, energy saturation preserves the connectivity of the chain. [sent-134, score-0.469]
57 In the Wang-Landau algorithm, proposed configurations are then rejected with a probability that depends on the density of states of the respective energy levels. [sent-138, score-0.62]
58 It is inspired by local search SAT solvers [12] and is especially critical for the class of highly combinatorial energy functions we consider in this work. [sent-141, score-0.381]
59 We note that if the acceptance probability is taken to be min 1, n(E(σ)) Tσ →σ n(E(σ )) Tσ→σ 4 1400000 MCMCFlatSAT Number of moves Number of moves 3000000 2500000 2000000 Acc. [sent-142, score-0.216]
60 down Energy level Energy level Figure 1: Histograms depicting the number of proposed moves accepted and rejected. [sent-154, score-0.188]
61 In order to understand the motivation behind the new proposal distribution, consider the move acceptance/rejection histogram shown in the left panel of Figure 1. [sent-159, score-0.221]
62 For the instance under consideration, MCMCFlatSAT converged to a flat histogram after having visited each of the 385 energy levels (on x-axis) roughly 2. [sent-160, score-0.552]
63 Each colored region shows the cumulative number of moves (on y-axis) accepted or rejected from each energy level (on x-axis) to another configuration with a higher, equal, or lower energy level, resp. [sent-162, score-0.878]
64 This gives six possible move types, and the histogram shows how often is each taken at any energy level. [sent-163, score-0.483]
65 Most importantly, notice that at low energy levels, a vast majority of the moves were proposed to a higher energy level and were rejected by the algorithm (shown as the dominating purple region). [sent-164, score-0.956]
66 This is an indirect consequence of the fact that in such instances, in the low energy regime, the density of states increases drastically as the energy level is increases, i. [sent-165, score-0.941]
67 As a result, most of the proposed moves are to higher energy levels and are in turn rejected by the algorithm in the move acceptance/rejection step discussed above. [sent-168, score-0.6]
68 In order to address this issue, we propose to modify the proposal distribution in a way that increases the chance of proposing moves to the same or lower energy levels, despite the fact that there are relatively few such moves. [sent-169, score-0.52]
69 Inspired by local search SAT solvers, we enhance MCMCFlatSAT with a focused random walk component that gives preference to selecting variables to flip from violated constraints (if any), thereby introducing an indirect bias towards lower energy states. [sent-170, score-0.545]
70 2M visits per energy level for the method to converge. [sent-182, score-0.408]
71 Moreover, the number of rejected moves (shown in purple and green) in low energy states is significantly fewer than the dominating purple region in the left panel. [sent-183, score-0.731]
72 As we see, incorporating energy saturation reduces the time to convergence (while achieving the same level of accuracy), and using focused random walk moves further decreases the convergence time, especially as n increases. [sent-186, score-0.698]
73 22 × 1011 345 240 128 191 168 Experimental evaluation We compare FocusedFlatSAT against several state-of-the-art methods for computing an estimate of or bound on the partition function. [sent-218, score-0.236]
74 2 An evaluation such as this is inherently challenging as the ground truth is very hard to obtain and computational bounds can be orders of magnitude off from the truth, making a comparison of estimates not very meaningful. [sent-219, score-0.265]
75 We therefore propose to evaluate the methods on either small instances whose ground truth can be evaluated by “brute force,” or larger instances whose ground truth (or bounds on it) can be computed analytically or through other tools such as efficient model counters. [sent-220, score-0.347]
76 Efficient methods for handling instances of small treewidth are also well known; here we push the boundaries to instances of relatively higher treewidth. [sent-222, score-0.226]
77 For partition function evaluation, we compare against the tree re-weighting (TRW) variational method for upper bounds, the iterated join-graph propagation (IJGP), and Gibbs sampling; see Section 2 for a very brief discussion of these approaches. [sent-223, score-0.244]
78 Unless otherwise specified, the energy function used is always the number of violated constraints, and we use a 50% ratio of random moves (p = 0. [sent-225, score-0.514]
79 In this case, the problem of computing the partition function is equivalent to standard model counting. [sent-233, score-0.236]
80 The partition function for these planar models is computable with a specialized polynomial time algorithm, as long as there is no external magnetic field [2]. [sent-242, score-0.296]
81 In Figure 3, we compare the true value of the partition function Z ∗ with the estimate obtained using FocusedFlatSAT and with the upper 2 Benchmark instances available online at http://www. [sent-243, score-0.285]
82 edu/∼ermonste 6 80 300 250 60 Log10(Z)-Log10(Z*) Log10(Z)-Log10(Z*) 70 50 40 FocusedFlatSAT 30 TRW 20 10 200 150 FocusedFlatSAT 100 TRW 50 0 0 0 1 2 3 4 5 6 -50 0 weight w 1 2 3 4 5 6 weight w Figure 3: Error in log10 (Z). [sent-246, score-0.148]
83 For the ferromagnetic model, the bounds obtained by TRW, on the other hand, are tight only when the weights are sufficiently high, when essentially only the two ground states of energy zero matter. [sent-309, score-0.614]
84 On spin glasses, where computing ground states is itself an intractable problem, TRW is unsurprisingly inaccurate even in the high weights regime. [sent-310, score-0.247]
85 The consistent accuracy of FocusedFlatSAT here is a strong indication that the method is accurately computing the density of most of the underlying states. [sent-311, score-0.191]
86 This is because, as the weight w changes, the value of the partition function is dominated by the contributions of a different set of states. [sent-312, score-0.282]
87 As another test case, we use formulas from a previously used model counting benchmark involving n × n Latin Square completion [11], and add a weight w to each constraint. [sent-321, score-0.167]
88 Since these instances have high treewidth, are non-planar, and beyond the reach of direct enumeration, we don’t have ground truth for this benchmark. [sent-322, score-0.159]
89 Nonetheless, on the un-simplified ls8-normalized with weight 6, both IJGP and Gibbs sampling underestimate by over 12 orders of magnitude. [sent-327, score-0.185]
90 Suppose the set of soft constraints Csoft is composed of M disjoint sets of constraints {Si }M , where all the constraints c ∈ Si have the same weight wi that we wish to learn i=1 from data (for instance, these constraints can all be groundings of the same first order formula in Markov Logic [8]). [sent-356, score-0.406]
91 The key observation is that the partition function can be written as Z(w) = . [sent-362, score-0.208]
92 , M ) is precisely the density of states required to compute Z(w) for all values of w, without additional inference steps. [sent-378, score-0.227]
93 5 We evaluate this learning method on relatively simple instances on which commonly used software such as Alchemy can be a few orders of magnitude off from the optimal likelihood of the training data. [sent-389, score-0.205]
94 The Optimal Likelihood value is obtained using a partition function computed either by direct enumeration or using analytic results for the synthetic instances. [sent-391, score-0.208]
95 The instance ThreeChain(K) is a grounding of the following first order formulas ∀xP (x) ⇒ Q(x), ∀xQ(x) ⇒ R(x), ∀xR(x) ⇒ P (x) while FourChain(K) is a similar chain of 4 implications. [sent-392, score-0.244]
96 The instance SocialNetwork(K) (from the Alchemy Tutorial) is a grounding of the following first order formulas where x, y ∈ {a1 , a2 , . [sent-397, score-0.147]
97 6 Conclusion We introduced FocusedFlatSAT, a Markov Chain Monte Carlo technique based on the flat histogram method with a random walk style component to estimate the partition function from the density of states. [sent-407, score-0.511]
98 Moreover, from the density of states we can obtain directly the partition function Z(w) as a function of the model parameters w. [sent-410, score-0.435]
99 Efficient, multiple-range random walk algorithm to calculate the density of states. [sent-430, score-0.205]
100 A new class of upper bounds on the log partition function. [sent-439, score-0.237]
wordName wordTfidf (topN-words)
[('focusedflatsat', 0.551), ('energy', 0.337), ('mcmcflatsat', 0.265), ('trw', 0.224), ('partition', 0.208), ('saturation', 0.132), ('alchemy', 0.125), ('density', 0.124), ('ijgp', 0.122), ('pw', 0.117), ('guration', 0.114), ('gurations', 0.109), ('moves', 0.108), ('states', 0.103), ('ferromagnetic', 0.102), ('logic', 0.101), ('histogram', 0.098), ('chain', 0.097), ('planar', 0.088), ('markov', 0.083), ('csoft', 0.082), ('walk', 0.081), ('instances', 0.077), ('proposal', 0.075), ('weight', 0.074), ('treewidth', 0.072), ('violated', 0.069), ('orders', 0.068), ('con', 0.067), ('formulas', 0.065), ('soft', 0.063), ('gibbs', 0.062), ('fourchain', 0.061), ('ec', 0.061), ('constraints', 0.058), ('ising', 0.058), ('rejected', 0.056), ('dh', 0.056), ('hard', 0.056), ('smokes', 0.054), ('levels', 0.051), ('energies', 0.05), ('grounding', 0.049), ('purple', 0.049), ('move', 0.048), ('visit', 0.047), ('gomes', 0.046), ('spin', 0.044), ('combinatorial', 0.044), ('ground', 0.043), ('sampling', 0.043), ('violates', 0.042), ('ermon', 0.041), ('hchain', 0.041), ('rejections', 0.041), ('sabharwal', 0.041), ('samplecount', 0.041), ('selman', 0.041), ('socialnetwork', 0.041), ('threechain', 0.041), ('xhard', 0.041), ('level', 0.04), ('steady', 0.039), ('truth', 0.039), ('accuracy', 0.039), ('wi', 0.037), ('propagation', 0.036), ('maxw', 0.036), ('counters', 0.036), ('gogate', 0.036), ('saturated', 0.036), ('ci', 0.035), ('boolean', 0.035), ('instance', 0.033), ('visited', 0.033), ('glasses', 0.033), ('sat', 0.033), ('ipping', 0.031), ('ipped', 0.031), ('poon', 0.031), ('visits', 0.031), ('likelihood', 0.03), ('wij', 0.03), ('magnitude', 0.03), ('ithaca', 0.029), ('dominating', 0.029), ('intricate', 0.029), ('inaccurate', 0.029), ('bounds', 0.029), ('counting', 0.028), ('computing', 0.028), ('logical', 0.028), ('inversely', 0.028), ('irreducible', 0.028), ('stationary', 0.028), ('runtime', 0.028), ('carlo', 0.028), ('graphical', 0.027), ('monte', 0.027), ('parameterize', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
2 0.10280293 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter
Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1
3 0.098668583 158 nips-2011-Learning unbelievable probabilities
Author: Xaq Pitkow, Yashar Ahmadian, Ken D. Miller
Abstract: Loopy belief propagation performs approximate inference on graphical models with loops. One might hope to compensate for the approximation by adjusting model parameters. Learning algorithms for this purpose have been explored previously, and the claim has been made that every set of locally consistent marginals can arise from belief propagation run on a graphical model. On the contrary, here we show that many probability distributions have marginals that cannot be reached by belief propagation using any set of model parameters or any learning algorithm. We call such marginals ‘unbelievable.’ This problem occurs whenever the Hessian of the Bethe free energy is not positive-definite at the target marginals. All learning algorithms for belief propagation necessarily fail in these cases, producing beliefs or sets of beliefs that may even be worse than the pre-learning approximation. We then show that averaging inaccurate beliefs, each obtained from belief propagation using model parameters perturbed about some learned mean values, can achieve the unbelievable marginals. 1
4 0.08683449 55 nips-2011-Collective Graphical Models
Author: Daniel R. Sheldon, Thomas G. Dietterich
Abstract: There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models—a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highlyefficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally. 1
5 0.085776493 197 nips-2011-On Tracking The Partition Function
Author: Guillaume Desjardins, Yoshua Bengio, Aaron C. Courville
Abstract: Markov Random Fields (MRFs) have proven very powerful both as density estimators and feature extractors for classification. However, their use is often limited by an inability to estimate the partition function Z. In this paper, we exploit the gradient descent training procedure of restricted Boltzmann machines (a type of MRF) to track the log partition function during learning. Our method relies on two distinct sources of information: (1) estimating the change ∆Z incurred by each gradient update, (2) estimating the difference in Z over a small set of tempered distributions using bridge sampling. The two sources of information are then combined using an inference procedure similar to Kalman filtering. Learning MRFs through Tempered Stochastic Maximum Likelihood, we can estimate Z using no more temperatures than are required for learning. Comparing to both exact values and estimates using annealed importance sampling (AIS), we show on several datasets that our method is able to accurately track the log partition function. In contrast to AIS, our method provides this estimate at each time-step, at a computational cost similar to that required for training alone. 1
6 0.079142503 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
7 0.075786658 276 nips-2011-Structured sparse coding via lateral inhibition
8 0.073012874 229 nips-2011-Query-Aware MCMC
9 0.063417859 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
10 0.061341725 227 nips-2011-Pylon Model for Semantic Segmentation
11 0.053626329 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
12 0.052247528 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
13 0.052169878 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
14 0.052139048 69 nips-2011-Differentially Private M-Estimators
15 0.05199134 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
16 0.051812217 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
17 0.051195074 217 nips-2011-Practical Variational Inference for Neural Networks
18 0.049941633 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
19 0.048392657 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
20 0.046802253 221 nips-2011-Priors over Recurrent Continuous Time Processes
topicId topicWeight
[(0, 0.161), (1, 0.008), (2, -0.003), (3, 0.016), (4, -0.025), (5, -0.074), (6, -0.066), (7, -0.054), (8, 0.009), (9, -0.046), (10, -0.008), (11, -0.105), (12, 0.001), (13, -0.072), (14, -0.051), (15, -0.01), (16, 0.011), (17, -0.088), (18, -0.067), (19, 0.07), (20, 0.041), (21, 0.073), (22, 0.045), (23, -0.028), (24, 0.005), (25, 0.096), (26, -0.05), (27, 0.011), (28, 0.068), (29, -0.017), (30, -0.05), (31, -0.089), (32, -0.019), (33, 0.053), (34, -0.055), (35, -0.046), (36, -0.011), (37, -0.05), (38, -0.077), (39, 0.033), (40, -0.079), (41, -0.15), (42, -0.071), (43, -0.071), (44, -0.021), (45, -0.056), (46, -0.121), (47, -0.055), (48, -0.079), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.93758565 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
2 0.72669971 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC
Author: Trung T. Pham, Tat-jun Chin, Jin Yu, David Suter
Abstract: Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient — if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and yields superior computational efficiency over previous two-stage methods. 1
3 0.7086063 55 nips-2011-Collective Graphical Models
Author: Daniel R. Sheldon, Thomas G. Dietterich
Abstract: There are many settings in which we wish to fit a model of the behavior of individuals but where our data consist only of aggregate information (counts or low-dimensional contingency tables). This paper introduces Collective Graphical Models—a framework for modeling and probabilistic inference that operates directly on the sufficient statistics of the individual model. We derive a highlyefficient Gibbs sampling algorithm for sampling from the posterior distribution of the sufficient statistics conditioned on noisy aggregate observations, prove its correctness, and demonstrate its effectiveness experimentally. 1
4 0.63043857 201 nips-2011-On the Completeness of First-Order Knowledge Compilation for Lifted Probabilistic Inference
Author: Guy Broeck
Abstract: Probabilistic logics are receiving a lot of attention today because of their expressive power for knowledge representation and learning. However, this expressivity is detrimental to the tractability of inference, when done at the propositional level. To solve this problem, various lifted inference algorithms have been proposed that reason at the first-order level, about groups of objects as a whole. Despite the existence of various lifted inference approaches, there are currently no completeness results about these algorithms. The key contribution of this paper is that we introduce a formal definition of lifted inference that allows us to reason about the completeness of lifted inference algorithms relative to a particular class of probabilistic models. We then show how to obtain a completeness result using a first-order knowledge compilation approach for theories of formulae containing up to two logical variables. 1 Introduction and related work Probabilistic logic models build on first-order logic to capture relational structure and on graphical models to represent and reason about uncertainty [1, 2]. Due to their expressivity, these models can concisely represent large problems with many interacting random variables. While the semantics of these logics is often defined through grounding the models [3], performing inference at the propositional level is – as for first-order logic – inefficient. This has motivated the quest for lifted inference methods that exploit the structure of probabilistic logic models for efficient inference, by reasoning about groups of objects as a whole and avoiding repeated computations. The first approaches to exact lifted inference have upgraded the variable elimination algorithm to the first-order level [4, 5, 6]. More recent work is based on methods from logical inference [7, 8, 9, 10], such as knowledge compilation. While these approaches often yield dramatic improvements in runtime over propositional inference methods on specific problems, it is still largely unclear for which classes of models these lifted inference operators will be useful and for which ones they will eventually have to resort to propositional inference. One notable exception in this regard is lifted belief propagation [11], which performs exact lifted inference on any model whose factor graph representation is a tree. A first contribution of this paper is that we introduce a notion of domain lifted inference, which formally defines what lifting means, and which can be used to characterize the classes of probabilistic models to which lifted inference applies. Domain lifted inference essentially requires that probabilistic inference runs in polynomial time in the domain size of the logical variables appearing in the model. As a second contribution we show that the class of models expressed as 2-WFOMC formulae (weighted first-order model counting with up to 2 logical variables per formula) can be domain lifted using an extended first-order knowledge compilation approach [10]. The resulting approach allows for lifted inference even in the presence of (anti-) symmetric or total relations in a theory. These are extremely common and useful concepts that cannot be lifted by any of the existing first-order knowledge compilation inference rules. 1 2 Background We will use standard concepts of function-free first-order logic (FOL). An atom p(t1 , . . . , tn ) consists of a predicate p/n of arity n followed by n arguments, which are either constants or logical variables. An atom is ground if it does not contain any variables. A literal is an atom a or its negation ¬a. A clause is a disjunction l1 ∨ ... ∨ lk of literals. If k = 1, it is a unit clause. An expression is an atom, literal or clause. The pred(a) function maps an atom to its predicate and the vars(e) function maps an expression to its logical variables. A theory in conjunctive normal form (CNF) is a conjunction of clauses. We often represent theories by their set of clauses and clauses by their set of literals. Furthermore, we will assume that all logical variables are universally quantified. In addition, we associate a set of constraints with each clause or atom, either of the form X = t, where X is a logical variable and t is a constant or variable, or of the form X ∈ D, where D is a domain, or the negation of these constraints. These define a finite domain for each logical variable. Abusing notation, we will use constraints of the form X = t to denote a substitution of X by t. The function atom(e) maps an expression e to its atoms, now associating the constraints on e with each atom individually. To add the constraint c to an expression e, we use the notation e ∧ c. Two atoms unify if there is a substitution which makes them identical and if the conjunction of the constraints on both atoms with the substitution is satisfiable. Two expressions e1 and e2 are independent, written e1 ⊥ e2 , if no atom a1 ∈ atom(e1 ) unifies with an atom a2 ∈ atom(e2 ). ⊥ We adopt the Weighted First-Order Model Counting (WFOMC) [10] formalism to represent probabilistic logic models, building on the notion of a Herbrand interpretation. Herbrand interpretations are subsets of the Herbrand base HB (T ), which consists of all ground atoms that can be constructed with the available predicates and constant symbols in T . The atoms in a Herbrand interpretation are assumed to be true. All other atoms in HB (T ) are assumed to be false. An interpretation I satisfies a theory T , written as I |= T , if it satisfies all the clauses c ∈ T . The WFOMC problem is defined on a weighted logic theory T , which is a logic theory augmented with a positive weight function w and a negative weight function w, which assign a weight to each predicate. The WFOMC problem involves computing wmc(T, w, w) = w(pred(a)) I|=T a∈I 3 3.1 w(pred(a)). (1) a∈HB(T )\I First-order knowledge compilation for lifted probabilistic inference Lifted probabilistic inference A first-order probabilistic model defines a probability distribution P over the set of Herbrand interpretations H. Probabilistic inference in these models is concerned with computing the posterior probability P(q|e) of query q given evidence e, where q and e are logical expressions in general: P(q|e) = h∈H,h|=q∧e P(h) h∈H,h|=e P(h) (2) We propose one notion of lifted inference for first-order probabilistic models, defined in terms of the computational complexity of inference w.r.t. the domains of logical variables. It is clear that other notions of lifted inference are conceivable, especially in the case of approximate inference. Definition 1 (Domain Lifted Probabilistic Inference). A probabilistic inference procedure is domain lifted for a model m, query q and evidence e iff the inference procedure runs in polynomial time in |D1 |, . . . , |Dk | with Di the domain of the logical variable vi ∈ vars(m, q, e). Domain lifted inference does not prohibit the algorithm to be exponential in the size of the vocabulary, that is, the number of predicates, arguments and constants, of the probabilistic model, query and evidence. In fact, the definition allows inference to be exponential in the number of constants which occur in arguments of atoms in the theory, query or evidence, as long as it is polynomial in the cardinality of the logical variable domains. This definition of lifted inference stresses the ability to efficiently deal with the domains of the logical variables that arise, regardless of their size, and formalizes what seems to be generally accepted in the lifted inference literature. 2 A class of probabilistic models is a set of probabilistic models expressed in a particular formalism. As examples, consider Markov logic networks (MLN) [12] or parfactors [4], or the weighted FOL theories for WFOMC that we introduced above, when the weights are normalized. Definition 2 (Completeness). Restricting queries to atoms and evidence to a conjunction of literals, a procedure that is domain lifted for all probabilistic models m in a class of models M and for all queries q and evidence e, is called complete for M . 3.2 First-order knowledge compilation First-order knowledge compilation is an approach to lifted probabilistic inference consisting of the following three steps (see Van den Broeck et al. [10] for details): 1. Convert the probabilistic logical model to a weighted CNF. Converting MLNs or parfactors requires adding new atoms to the theory that represent the (truth) value of each factor or formula. set-disjunction 2 friends(X, Y ) ∧ smokes(X) ⇒ smokes(Y ) Smokers ⊆ People decomposable conjunction unit clause leaf (a) MLN Model ∧ smokes(X), X ∈ Smokers smokes(Y ) ∨ ¬ smokes(X) ∨¬ friends(X, Y ) ∨ ¬ f(X, Y ) friends(X, Y ) ∨ f(X, Y ) smokes(X) ∨ f(X, Y ) ¬ smokes(Y ) ∨ f(X, Y ). ∧ f(X, Y ), Y ∈ Smokers ∧ ¬ smokes(Y ), Y ∈ Smokers / ∧ f(X, Y ), X ∈ Smokers, Y ∈ Smokers / / x ∈ Smokers (b) CNF Theory deterministic disjunction Predicate friends smokes f w 1 1 e2 w 1 1 1 y ∈ Smokers / ∨ set-conjunction ∧ f(x, y) (c) Weight Functions ¬ friends(x, y) ∧ friends(x, y) ¬ f(x, y) (d) First-Order d-DNNF Circuit Figure 1: Friends-smokers example (taken from [10]) Example 1. The MLN in Figure 1a assigns a weight to a formula in FOL. Figure 1b represents the same model as a weighted CNF, introducing a new atom f(X, Y ) to encode the truth value of the MLN formula. The probabilistic information is captured by the weight functions in Figure 1c. 2. Compile the logical theory into a First-Order d-DNNF (FO d-DNNF) circuit. Figure 1d shows an example of such a circuit. Leaves represent unit clauses. Inner nodes represent the disjunction or conjunction of their children l and r, but with the constraint that disjunctions must be deterministic (l ∧ r is unsatisfiable) and conjunctions must be decomposable (l ⊥ r). ⊥ 3. Perform WFOMC inference to compute posterior probabilities. In a FO d-DNNF circuit, WFOMC is polynomial in the size of the circuit and the cardinality of the domains. To compile the CNF theory into a FO d-DNNF circuit, Van den Broeck et al. [10] propose a set of compilation rules, which we will refer to as CR 1 . We will now briefly describe these rules. Unit Propagation introduces a decomposable conjunction when the theory contains a unit clause. Independence creates a decomposable conjunction when the theory contains independent subtheories. Shannon decomposition applies when the theory contains ground atoms and introduces a deterministic disjunction between two modified theories: one where the ground atom is true, and one where it is false. Shattering splits clauses in the theory until all pairs of atoms represent either a disjoint or identical set of ground atoms. Example 2. In Figure 2a, the first two clauses are made independent from the friends(X, X) clause and split off in a decomposable conjunction by unit propagation. The unit clause becomes a leaf of the FO d-DNNF circuit, while the other operand requires further compilation. 3 dislikes(X, Y ) ∨ friends(X, Y ) fun(X) ∨ ¬ friends(X, Y ) friends(X, Y ) ∨ dislikes(X, Y ) ¬ friends(X, Y ) ∨ likes(X, Y ) friends(X, X) fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) ∧ FunPeople ⊆ People x ∈ People friends(X, X) friends(X, Y ) ∨ dislikes(X, Y ), X = Y ¬ friends(X, Y ) ∨ likes(X, Y ), X = Y likes(X, X) dislikes(x, Y ) ∨ friends(x, Y ) fun(x) ∨ ¬ friends(x, Y ) fun(X), X ∈ FunPeople ¬ fun(X), X ∈ FunPeople / fun(X) ∨ ¬ friends(X, Y ) fun(X) ∨ ¬ friends(Y, X) (a) Unit propagation of friends(X, X) (b) Independent partial grounding (c) Atom counting of fun(X) Figure 2: Examples of compilation rules. Circles are FO d-DNNF inner nodes. White rectangles show theories before and after applying the rule. All variable domains are People. (taken from [10]) Independent Partial Grounding creates a decomposable conjunction over a set of child circuits, which are identical up to the value of a grounding constant. Since they are structurally identical, only one child circuit is actually compiled. Atom Counting applies when the theory contains an atom with a single logical variable X ∈ D. It explicitly represents the domain D ⊆ D of X for which the atom is true. It compiles the theory into a deterministic disjunction between all possible such domains. Again, these child circuits are identical up to the value of D and only one is compiled. Example 3. The theory in Figure 2b is compiled into a decomposable set-conjunction of theories that are independent and identical up to the value of the x constant. The theory in Figure 2c contains an atom with one logical variable: fun(X). Atom counting compiles it into a deterministic setdisjunction over theories that differ in FunPeople, which is the domain of X for which fun(X) is true. Subsequent steps of unit propagation remove the fun(X) atoms from the theory entirely. 3.3 Completeness We will now characterize those theories where the CR 1 compilation rules cannot be used, and where the inference procedure has to resort to grounding out the theory to propositional logic. For these, first-order knowledge compilation using CR 1 is not yet domain lifted. When a logical theory contains symmetric, anti-symmetric or total relations, such as friends(X, Y ) ⇒ friends(Y, X), parent(X, Y ) ⇒ ¬ parent(Y, X), X = Y, ≤ (X, Y) ∨ ≤ (Y, X), or more general formulas, such as enemies(X, Y ) ⇒ ¬ friend(X, Y ) ∧ ¬ friend(Y, X), (3) (4) (5) (6) none of the CR 1 rules apply. Intuitively, the underlying problem is the presence of either: • Two unifying (not independent) atoms in the same clause which contain the same logical variable in different positions of the argument list. Examples include (the CNF of) Formulas 3, 4 and 5, where the X and Y variable are bound by unifying two atoms from the same clause. • Two logical variables that bind when unifying one pair of atoms but appear in different positions of the argument list of two other unifying atoms. Examples include Formula 6, which in CNF is ¬ friend(X, Y ) ∨ ¬ enemies(X, Y ) ¬ friend(Y, X) ∨ ¬ enemies(X, Y ) Here, unifying the enemies(X, Y ) atoms binds the X variables from both clauses, which appear in different positions of the argument lists of the unifying atoms friend(X, Y ) and friend(Y, X). Both of these properties preclude the use of CR 1 rules. Also in the context of other model classes, such as MLNs, probabilistic versions of the above formulas cannot be processed by CR 1 rules. 4 Even though first-order knowledge compilation with CR 1 rules does not have a clear completeness result, we can show some properties of theories to which none of the compilation rules apply. First, we need to distinguish between the arity of an atom and its dimension. A predicate with arity two might have atoms with dimension one, when one of the arguments is ground or both are identical. Definition 3 (Dimension of an Expression). The dimension of an expression e is the number of logical variables it contains: dim(e) = | vars(e)|. Lemma 1 (CR 1 Postconditions). The CR 1 rules remove all atoms from the theory T which have zero or one logical variable arguments, such that afterwards ∀a ∈ atom(T ) : dim(a) > 1. When no CR 1 rule applies, the theory is shattered and contains no independent subtheories. Proof. Ground atoms are removed by the Shannon decomposition operator followed by unit propagation. Atoms with a single logical variable (including unary relations) are removed by the atom counting operator followed by unit propagation. If T contains independent subtheories, the independence operator can be applied. Shattering is always applied when T is not yet shattered. 4 Extending first-order knowledge compilation In this section we introduce a new operator which does apply to the theories from Section 3.3. 4.1 Logical variable properties To formally define the operator we propose, and prove its correctness, we first introduce some mathematical concepts related to the logical variables in a theory (partly after Jha et al. [8]). Definition 4 (Binding Variables). Two logical variables X, Y are directly binding b(X, Y ) if they are bound by unifying a pair of atoms in the theory. The binding relationship b+ (X, Y ) is the transitive closure of the directly binding relation b(X, Y ). Example 4. In the theory ¬ p(W, X) ∨ ¬ q(X) r(Y ) ∨ ¬ q(Y ) ¬ r(Z) ∨ s(Z) the variable pairs (X, Y ) and (Y, Z) are directly binding. The variables X, Y and Z are binding. Variable W does not bind to any other variable. Note that the binding relationship b+ (X, Y ) is an equivalence relation that defines two equivalence classes: {X, Y, Z} and {W }. Lemma 2 (Binding Domains). After shattering, binding logical variables have identical domains. Proof. During shattering (see Section 3.2), when two atoms unify, binding two variables with partially overlapping domains, the atoms’ clauses are split up into clauses where the domain of the variables is identical, and clauses where the domains are disjoint and the atoms no longer unify. Definition 5 (Root Binding Class). A root variable is a variable that appears in all the atoms in its clause. A root binding class is an equivalence class of binding variables where all variables are root. Example 5. In the theory of Example 4, {X, Y, Z} is a root binding class and {W } is not. 4.2 Domain recursion We will now introduce the new domain recursion operator, starting with its preconditions. Definition 6. A theory allows for domain recursion when (i) the theory is shattered, (ii) the theory contains no independent subtheories and (iii) there exists a root binding class. From now on, we will denote with C the set of clauses of the theory at hand and with B a root binding class guaranteed to exist if C allows for domain recursion. Lemma 2 states that all variables in B have identical domains. We will denote the domain of these variables with D. The intuition behind the domain recursion operator is that it modifies D by making one element explicit: D = D ∪ {xD } with xD ∈ D . This explicit domain element is introduced by the S PLIT D / function, which splits clauses w.r.t. the new subdomain D and element xD . 5 Definition 7 (S PLIT D). For a clause c and given set of variables Vc ⊆ vars(c) with domain D, let S PLIT D(c, Vc ) = c, if Vc = ∅ S PLIT D(c1 , Vc \ {V }) ∪ S PLIT D(c2 , Vc \ {V }), if Vc = ∅ (7) where c1 = c ∧ (V = xD ) and c2 = c ∧ (V = xD ) ∧ (V ∈ D ) for some V ∈ Vc . For a set of clauses C and set of variables V with domain D: S PLIT D(C, V) = c∈C S PLIT D(c, V ∩ vars(c)). The domain recursion operator creates three sets of clauses: S PLIT D(C, B) = Cx ∪ Cv ∪ Cr , with Cx = {c ∧ (V = xD )|c ∈ C}, (8) V ∈B∩vars(c) Cv = {c ∧ (V = xD ) ∧ (V ∈ D )|c ∈ C}, (9) V ∈B∩vars(c) Cr = S PLIT D(C, B) \ Cx \ Cv . (10) Proposition 3. The conjunction of the domain recursion sets is equivalent to the original theory: c∈Cr c . c∈Cv c ∧ c∈C c ≡ c∈S PLIT D(C,B) c and therefore c∈C c ≡ c∈Cx c ∧ We will now show that these sets are independent and that their conjunction is decomposable. Theorem 4. The theories Cx , Cv and Cr are independent: Cx ⊥ Cv , Cx ⊥ Cr and Cv ⊥ Cr . ⊥ ⊥ ⊥ The proof of Theorem 4 relies on the following Lemma. Lemma 5. If the theory allows for domain recursion, all clauses and atoms contain the same number of variables from B: ∃n, ∀c ∈ C, ∀a ∈ atom(C) : | vars(c) ∩ B | = | vars(a) ∩ B | = n. c Proof. Denote with Cn the clauses in C that contain n logical variables from B and with Cn its compliment in C. If C is nonempty, there is a n > 0 for which Cn is nonempty. Then every atom in Cn contains exactly n variables from B (Definition 5). Since the theory contains no independent c c subtheories, there must be an atom a in Cn which unifies with an atom ac in Cn , or Cn is empty. After shattering, all unifications bind one variable from a to a single variable from ac . Because a contains exactly n variables from B, ac must also contain exactly n (Definition 4), and because B is c a root binding class, the clause of ac also contains exactly n, which contradicts the definition of Cn . c Therefore, Cn is empty, and because the variables in B are root, they also appear in all atoms. Proof of Theorem 4. From Lemma 5, all atoms in C contain the same number of variables from B. In Cx , these variables are all constrained to be equal to xD , while in Cv and Cr at least one variable is constrained to be different from xD . An attempt to unify an atom from Cx with an atom from Cv or Cr therefore creates an unsatisfiable set of constraints. Similarly, atoms from Cv and Cr cannot be unified. Finally, we extend the FO d-DNNF language proposed in Van den Broeck et al. [10] with a new node, the recursive decomposable conjunction ∧ r , and define the domain recursion compilation rule. Definition 8 ( ∧ r ). The FO d-DNNF node ∧ r (nx , nr , D, D , V) represents a decomposable conjunction between the d-DNNF nodes nx , nr and a d-DNNF node isomorphic to the ∧ r node itself. In particular, the isomorphic operand is identical to the node itself, except for the size of the domain of the variables in V, which becomes one smaller, going from D to D in the isomorphic operand. We have shown that the conjunction between sets Cx , Cv and Cr is decomposable (Theorem 4) and logically equivalent to the original theory (Proposition 3). Furthermore, Cv is identical to C, up to the constraints on the domain of the variables in B. This leads us to the following definition of domain recursion. Definition 9 (Domain Recursion). The domain recursion compilation rule compiles C into ∧ r (nx , nr , D, D , B), where nx , nr are the compiled circuits for Cx , Cr . The third set Cv is represented by the recursion on D, according to Definition 8. 6 nv Cr ∧r ¬ friends(x, X) ∨ friends(X, x), X = x ¬ friends(X, x) ∨ friends(x, X), X = x P erson ← P erson \ {x} nr nx ∨ Cx ¬ friends(x, x) ∨ friends(x, x) x ∈P erson x =x ¬ friends(x, x) friends(x, x) ∨ ∧ ¬ friends(x, x ) ¬ friends(x , x) ∧ friends(x, x ) friends(x , x) Figure 3: Circuit for the symmetric relation in Equation 3, rooted in a recursive conjunction. Example 6. Figure 3 shows the FO d-DNNF circuit for Equation 3. The theory is split up into three independent theories: Cr and Cx , shown in the Figure 3, and Cv = {¬ friends(X, Y ) ∨ friends(Y, X), X = x, Y = x}. The conjunction of these theories is equivalent to Equation 3. Theory Cv is identical to Equation 3, up to the inequality constraints on X and Y . Theorem 6. Given a function size, which maps domains to their size, the weighted first-order model count of a ∧ r (nx , nr , D, D , V) node is size(D) wmc( ∧ r (nx , nr , D, D , V), size) = wmc(nx , size)size(D) wmc(nr , size ∪{D → s}), s=0 (11) where size ∪{D → s} adds to the size function that the subdomain D has cardinality s. Proof. If C allows for domain recursion, due to Theorem 4, the weighted model count is wmc(C, size) = 1, if size(D) = 0 wmc(Cx ) · wmc(Cv , size ) · wmc(Cr , size ) if size(D) > 0 (12) where size = size ∪{D → size(D) − 1}. Theorem 7. The Independent Partial Grounding compilation rule is a special case of the domain recursion rule, where ∀c ∈ C : | vars(c) ∩ B | = 1 (and therefore Cr = ∅). 4.3 Completeness In this section, we introduce a class of models for which first-order knowledge compilation with domain recursion is complete. Definition 10 (k-WFOMC). The class of k-WFOMC consist of WFOMC theories with clauses that have up to k logical variables. A first completeness result is for 2-WFOMC, using the set of knowledge compilation rules CR 2 , which are the rules in CR 1 extended with domain recursion. Theorem 8 (Completeness for 2-WFOMC). First-order knowledge compilation using the CR 2 compilation rules is a complete domain lifted probabilistic inference algorithm for 2-WFOMC. Proof. From Lemma 1, after applying the CR 1 rules, the theory contains only atoms with dimension larger than or equal to two. From Definition 10, each clause has dimension smaller than or equal to two. Therefore, each logical variable in the theory is a root variable and according to Definition 5, every equivalence class of binding variables is a root binding class. Because of Lemma 1, the theory allows for domain recursion, which requires further compilation of two theories: Cx and Cr into nx and nr . Both have dimension smaller than 2 and can be lifted by CR 1 compilation rules. The properties of 2-WFOMC are a sufficient but not necessary condition for first-order knowledge compilation to be domain lifted. We can obtain a similar result for MLNs or parfactors by reducing them to a WFOMC problem. If an MLN contains only formulae with up to k logical variables, then its WFOMC representation will be in k-WFOMC. 7 This result for 2-WFOMC is not trivial. Van den Broeck et al. [10] showed in their experiments that counting first-order variable elimination (C-FOVE) [6] fails to lift the “Friends Smoker Drinker” problem, which is in 2-WFOMC. We will show in the next section that the CR 1 rules fail to lift the theory in Figure 4a, which is in 2-WFOMC. Note that there are also useful theories that are not in 2-WFOMC, such as those containing the transitive relation friends(X, Y ) ∧ friends(Y, Z) ⇒ friends(X, Z). 5 Empirical evaluation To complement the theoretical results of the previous section, we extended the WFOMC implementation1 with the domain recursion rule. We performed experiments with the theory in Figure 4a, which is a version of the friends and smokers model [11] extended with the symmetric relation of Equation 3. We evaluate the performance querying P(smokes(bob)) with increasing domain size, comparing our approach to the existing WFOMC implementation and its propositional counterpart, which first grounds the theory and then compiles it with the c2d compiler [13] to a propositional d-DNNF circuit. We did not compare to C-FOVE [6] because it cannot perform lifted inference on this model. 2 smokes(X) ∧ friends(X, Y ) ⇒ smokes(Y ) friends(X, Y ) ⇒ friends(Y, X). Runtime [s] Propositional inference quickly becomes intractable when there are more than 20 people. The lifted inference algorithms scale much better. The CR 1 rules can exploit some regularities in the model. For example, they eliminate all the smokes(X) atoms from the theory. They do, however, resort to grounding at a later stage of the compilation process. With the domain recursion rule, there is no need for grounding. This advantage is clear in the experiments, our approach having an almost constant inference time in this range of domains sizes. Note that the runtimes for c2d include compilation and evaluation of the circuit, whereas the WFOMC runtimes only represent evaluation of the FO d-DNNF. After all, propositional compilation depends on the domain size but first-order compilation does not. First-order compilation takes a constant two seconds for both rule sets. 10000 1000 100 10 1 0.1 0.01 c2d WFOMC - CR1 WFOMC - CR2 10 20 30 40 50 60 Number of People 70 80 (b) Evaluation Runtime (a) MLN Model Figure 4: Symmetric friends and smokers experiment, comparing propositional knowledge compilation (c2d) to WFOMC using compilation rules CR 1 and CR 2 (which includes domain recursion). 6 Conclusions We proposed a definition of complete domain lifted probabilistic inference w.r.t. classes of probabilistic logic models. This definition considers algorithms to be lifted if they are polynomial in the size of logical variable domains. Existing first-order knowledge compilation turns out not to admit an intuitive completeness result. Therefore, we generalized the existing Independent Partial Grounding compilation rule to the domain recursion rule. With this one extra rule, we showed that first-order knowledge compilation is complete for a significant class of probabilistic logic models, where the WFOMC representation has up to two logical variables per clause. Acknowledgments The author would like to thank Luc De Raedt, Jesse Davis and the anonymous reviewers for valuable feedback. This work was supported by the Research Foundation-Flanders (FWO-Vlaanderen). 1 http://dtai.cs.kuleuven.be/wfomc/ 8 References [1] Lise Getoor and Ben Taskar, editors. An Introduction to Statistical Relational Learning. MIT Press, 2007. [2] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen Muggleton, editors. Probabilistic inductive logic programming: theory and applications. Springer-Verlag, Berlin, Heidelberg, 2008. [3] Daan Fierens, Guy Van den Broeck, Ingo Thon, Bernd Gutmann, and Luc De Raedt. Inference in probabilistic logic programs using weighted CNF’s. In Proceedings of UAI, pages 256–265, 2011. [4] David Poole. First-order probabilistic inference. In Proceedings of IJCAI, pages 985–991, 2003. [5] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted first-order probabilistic inference. In Proceedings of IJCAI, pages 1319–1325, 2005. [6] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, and Leslie Pack Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of AAAI, pages 1062–1068, 2008. [7] Vibhav Gogate and Pedro Domingos. Exploiting Logical Structure in Lifted Probabilistic Inference. In Proceedings of StarAI, 2010. [8] Abhay Jha, Vibhav Gogate, Alexandra Meliou, and Dan Suciu. Lifted Inference Seen from the Other Side: The Tractable Features. In Proceedings of NIPS, 2010. [9] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of UAI, pages 256–265, 2011. [10] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis, and Luc De Raedt. Lifted Probabilistic Inference by First-Order Knowledge Compilation. In Proceedings of IJCAI, pages 2178–2185, 2011. [11] Parag Singla and Pedro Domingos. Lifted first-order belief propagation. In Proceedings of AAAI, pages 1094–1099, 2008. [12] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine Learning, 62(1): 107–136, 2006. [13] Adnan Darwiche. New advances in compiling CNF to decomposable negation normal form. In Proceedings of ECAI, pages 328–332, 2004. 9
5 0.57870948 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
6 0.57278413 197 nips-2011-On Tracking The Partition Function
7 0.56319374 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions
8 0.55143768 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.53779876 158 nips-2011-Learning unbelievable probabilities
10 0.52938688 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
11 0.47736043 229 nips-2011-Query-Aware MCMC
12 0.46563366 221 nips-2011-Priors over Recurrent Continuous Time Processes
13 0.44741091 274 nips-2011-Structure Learning for Optimization
14 0.44740418 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
15 0.44348386 69 nips-2011-Differentially Private M-Estimators
16 0.44195032 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
17 0.43313795 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
18 0.42874426 276 nips-2011-Structured sparse coding via lateral inhibition
19 0.4182626 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
20 0.41593498 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
topicId topicWeight
[(0, 0.024), (4, 0.035), (20, 0.044), (26, 0.06), (31, 0.095), (33, 0.051), (43, 0.049), (45, 0.077), (57, 0.067), (74, 0.045), (83, 0.052), (87, 0.259), (90, 0.012), (99, 0.046)]
simIndex simValue paperId paperTitle
1 0.82573158 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
same-paper 2 0.74651361 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
3 0.70010406 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
4 0.66440433 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
5 0.56731778 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.55481529 66 nips-2011-Crowdclustering
7 0.55481136 227 nips-2011-Pylon Model for Semantic Segmentation
8 0.55205327 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.55200797 55 nips-2011-Collective Graphical Models
10 0.55069834 180 nips-2011-Multiple Instance Filtering
11 0.54990196 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
12 0.54979867 229 nips-2011-Query-Aware MCMC
13 0.54862362 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
14 0.54806107 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
15 0.54803246 303 nips-2011-Video Annotation and Tracking with Active Learning
16 0.54548496 102 nips-2011-Generalised Coupled Tensor Factorisation
17 0.54499477 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
18 0.54499251 219 nips-2011-Predicting response time and error rates in visual search
19 0.54209006 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
20 0.54185683 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning