nips nips2010 nips2010-118 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. [sent-3, score-0.358]
2 Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. [sent-4, score-0.49]
3 This method can be used with approximate inference, with a loss function over approximate marginals. [sent-5, score-0.219]
4 Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. [sent-6, score-0.222]
5 Though the likelihood and conditional likelihood are the most widespread training objectives, these are sometimes undesirable and/or infeasible in real applications. [sent-8, score-0.275]
6 With low treewidth, if the data is truly distributed according to the chosen graphical model with some parameters, any consistent loss function will recover those true parameters in the high-data limit, and so one might select a loss function according to statistical convergence rates [1]. [sent-9, score-0.312]
7 In this case, different loss functions lead to different asymptotic parameter estimates. [sent-11, score-0.113]
8 For lowtreewidth graphs, several loss functions have been proposed that prioritize different types of accuracy (section 2. [sent-13, score-0.113]
9 For parameters θ, these loss functions are given as a function ∂L L(µ(θ)) of marginals µ(θ). [sent-15, score-0.305]
10 The likelihood may also be infeasible to optimize, due to the computational intractability of computing the log-partition function or its derivatives in high treewidth graphs. [sent-18, score-0.211]
11 On the other hand, if an approximate inference algorithm will be used at test time, it is logical to design the loss function to compensate for defects in inference. [sent-19, score-0.279]
12 The surrogate likelihood (the likelihood with an approximate partition function) can give superior results to the true likelihood, when approximate inference is used at test time[4]. [sent-20, score-0.611]
13 If µ(θ) is the function mapping parameters to (approximate) marginals, and there is some loss function L(µ) defined on those marginals, we desire to recover dL . [sent-22, score-0.144]
14 This enables the use of the marginal-based loss functions mentioned dθ previously, but defined on approximate marginals. [sent-23, score-0.166]
15 The major disadvantage of this approach is that standard linear solvers can perform poorly on large 1 True (y) Noisy (x) Surrogate likelihood Clique likelihood Univariate likelihood Smooth class. [sent-26, score-0.425]
16 error Figure 1: Example images from the Berkeley dataset, along with marginals for a conditional random field fit with various loss functions. [sent-27, score-0.342]
17 graphs, meaning that calculating this gradient can be more expensive than performing inference (Section 4). [sent-28, score-0.21]
18 However, it is tied to a specific entropy approximation (Bethe) and algorithm (Loopy Belief Propagation). [sent-33, score-0.258]
19 Extension to similar message-passing algorithms appears possible, but extension to more complex inference algorithms [7, 8, 9] is unclear. [sent-34, score-0.113]
20 Here, we observe that the loss gradient can be calculated by far more straightforward means. [sent-35, score-0.171]
21 This result follows from, first, the well-known trick of approximating Jacobianvector products by finite differences and, second, the special property that for marginal dµ inference, the Jacobian matrix dθT is symmetric. [sent-37, score-0.106]
22 This result applies when marginal inference takes place over the local polytope with an entropy that is concave and obeys a minor technical condition. [sent-38, score-0.614]
23 It can also be used with non-concave entropies, with some assumptions on how inference recovers different local optima. [sent-39, score-0.145]
24 It is easy to use this to compute the gradient of essentially any differentiable loss function defined on marginals. [sent-40, score-0.171]
25 Effectively, all one needs to do is re-run the inference procedure on a set of parameters slightly "perturbed" in the direction ∂L . [sent-41, score-0.113]
26 Aside from this, like the matrix inversion approach, it is independent of the algorithm used to perform independence, and applicable to a variety of different inference approximations. [sent-44, score-0.144]
27 1 Background Marginal Inference This section briefly reviews the aspects of graphical models and marginal inference that are required for the rest of the paper. [sent-47, score-0.274]
28 (2) Marginal inference means recovering the expected value of f or, equivalently, the marginal probability that each factor or variable have a particular value. [sent-56, score-0.219]
29 µ(θ) = p(x; θ)f (x) (3) x Though marginals could, in principle, be computed by the brute-force sum in Eq. [sent-57, score-0.192]
30 3, it is useful to consider the paired variational representation [10, Chapter 3] A(θ) = µ(θ) = dA dθ = max θ · µ + H(µ) (4) arg max θ · µ + H(µ), (5) µ∈M µ∈M in which A and µ can both be recovered from solving the same optimization problem. [sent-58, score-0.128]
31 Here, M = {µ(θ)|θ ∈ ℜn } is the marginal polytope– those marginals µ resulting from some parameter vector θ. [sent-59, score-0.298]
32 Similarly, H(µ) is the entropy of p(x; θ ′ ), where θ′ is the vector of parameters that produces the marginals µ. [sent-60, score-0.389]
33 A variety of approximate inference methods can be seen as solving a modification of Eqs. [sent-64, score-0.166]
34 4 and 5, with the marginal polytope and entropy replaced with tractable approximations. [sent-65, score-0.412]
35 Notice that these are also paired; the approximate µ is the exact gradient of the approximate A. [sent-66, score-0.164]
36 The commonest relaxation of M is the local polytope L = {µ ≥ 0 | µ(xi ) = µ(xi ) = 1}. [sent-67, score-0.141]
37 µ(xα ), xα\i (6) xi This underlies loopy belief propagation, as well as tree-reweighted belief propagation. [sent-68, score-0.346]
38 Since a valid set of marginals must obey these constraints, L ⊇ M. [sent-69, score-0.226]
39 The Bethe approximation implicit in loopy belief propagation [11] is non-concave in general, which results in sometimes failing to achieve the global optimum. [sent-72, score-0.359]
40 Concave entropy functions include the tree-reweighted entropy [12], convexified Bethe entropies [13], and the class of entropies obeying Heskes’ conditions [14]. [sent-73, score-0.628]
41 The (negative) likelihood is the classic loss function for training graphical models. [sent-77, score-0.287]
42 The surrogate likelihood [4] is simply the likelihood as in Eq. [sent-85, score-0.392]
43 Unlike the losses below, the surrogate likelihood is convex when based on a concave inference method. [sent-89, score-0.443]
44 [15] for a variant of this for inference with local optima. [sent-91, score-0.145]
45 If the application will only make use of univariate marginals at test time, one might fit parameters specifically to make these univariate marginals accurate. [sent-93, score-0.846]
46 [3] proposed the loss L(ˆ ; θ) = − x log µ(ˆi ; θ). [sent-95, score-0.113]
47 x (11) i This can be computed in treelike graphs, after running belief propagation to compute marginals. [sent-96, score-0.252]
48 [2] considered the loss L(ˆ ; θ) = x i S max µ(xi ; θ) − µ(ˆi ; θ) , x xi =ˆi x (12) where S is the step function. [sent-101, score-0.201]
49 This loss measures the number of incorrect components of ˆ x if each is predicted to be the “max marginal”. [sent-102, score-0.113]
50 As with the univariate likelihood, this loss can be computed if exact marginals are available. [sent-105, score-0.536]
51 One can easily define clique versions of the previous two loss functions, where the summations are over α, rather than i. [sent-108, score-0.25]
52 These measure the accuracy of clique-wise marginals, rather than univariate marginals. [sent-109, score-0.231]
53 7, the equality constraints in the local polytope are linear, and hence when the positivity constraint can be disregarded, approximate marginal inference algorithms can be seen as solving the optimization µ(θ) = arg maxµ,Bµ=d θ · µ + H(µ). [sent-112, score-0.482]
54 Domke showed[5], in our notation, that dL dL = D−1 B T (BD−1 B T )−1 BD−1 − D−1 , dθ dµ where D = ∂2H ∂µ∂µT (13) is the (diagonal) Hessian of the entropy approximation. [sent-113, score-0.197]
55 Nonlinear and tied parameters are easily dealt with by considering θ(φ) to be a function of the “true” 4 Algorithm 1 Calculating loss derivatives (two-sided). [sent-122, score-0.212]
56 ∂L µ+ ← arg max (θ + r ) · µ + H(µ) µ∈M ∂µ µ− ← arg max (θ − r µ∈M ∂L ) · µ + H(µ) ∂µ 1 + dL ← (µ − µ− ) dθ 2r 5. [sent-131, score-0.138]
57 (14) dφ dφ dθ Conditional training is similar: define a distribution over a random variable y, parametrized by θ(φ; x), the derivative on a particular pair (x, y) is given again by Eq. [sent-135, score-0.088]
58 3 Implicit Differentiation by Perturbation This section shows that when µ(θ) = arg maxµ∈L θ · µ + H(µ), the loss gradient can be computed by Alg. [sent-138, score-0.21]
59 1 for a concave entropy approximation of the form H(µ) = − cα α xα µ(xα ) log µ(xα ) − µ(xi ) log µ(xi ), ci i (15) xi when the counting numbers c obey (as is true of most proposed entropies) (16) cα > 0. [sent-139, score-0.407]
60 cα > 0, ci + α,i∈α For intuition, the following Lemma uses notation (µ, θ, H) suggesting the application to marginal inference. [sent-140, score-0.167]
61 t arg max µ · θ + H(µ) (17) Bµ − d = 0, (18) µ where H(µ) is strictly convex and twice differentiable, then dµ exists and is symmetric. [sent-144, score-0.105]
62 θ + ∂H(µ)/∂µ + B T λ Bµ − d 5 = 0 0 (20) Recall the general implicit function theorem. [sent-148, score-0.084]
63 Again, this uses notation suggesting the application to implicit differentiation and marginal inference, but holds true for any functions satisfying the stated conditions. [sent-155, score-0.19]
64 1 follows from applying this theorem to marginal inference. [sent-165, score-0.106]
65 If H(µ) = α cα H(µc ) + i ci H(µi ), and µ∗ is a (possibly local) maximum of θ · µ + H(µ), under the local polytope L, then cα > 0, ci + α,i∈α cα > 0 −→ µ∗ > 0. [sent-169, score-0.263]
66 With a non-concave entropy this condition is still valid, not not unique, since there can be local optima. [sent-176, score-0.229]
67 BBP essentially calculates this 6 3 10 Bethe entropy 2 2 10 running time (s) running time (s) Bethe entropy 3 10 1 10 0 10 −1 10 10 1 10 0 10 pert−BP symmlq BBP direct BP −1 10 −2 −2 10 10 8 32 128 0. [sent-177, score-0.606]
68 5 512 grid size 3 10 TRW entropy 2 TRW entropy 3 10 2 2 10 running time (s) running time (s) 1 interaction strength 1 10 0 10 −1 10 10 1 10 0 10 pert−TRWS symmlq direct TRWS −1 10 −2 −2 10 10 8 32 128 0. [sent-178, score-0.718]
69 5 512 grid size 1 interaction strength 2 Figure 2: Times to compute dL/dθ by perturbation, Back Belief Propagation (BBP), sparse matrix factorization (direct) and the iterative symmetric-LQ method (symmlq). [sent-179, score-0.112]
70 As these results use two-sided differences, perturbation always takes twice the running time of the base inference algorithm. [sent-181, score-0.362]
71 , 5}, with univariate terms θ(xi ) taken uniformly from [−1, +1] and interaction strengths θ(xi , xj ) from [−a, +a] for varying a. [sent-186, score-0.352]
72 Top Left: Bethe entropy for varying grid sizes, with a = 1. [sent-187, score-0.311]
73 Top Right: Bethe entropy with a grid size of 32 and varying interaction strengths a. [sent-189, score-0.397]
74 Bottom Left: Varying grid sizes with the TRW entropy. [sent-191, score-0.079]
75 Bottom Right: TRW entropy with a grid size of 32 and varying interactions. [sent-192, score-0.311]
76 If perturbed beliefs are calculated from constant initial messages with a small step, one obtains the same result. [sent-194, score-0.133]
77 Thus, BBP and perturbation give the same gradient for the Bethe approximation. [sent-195, score-0.177]
78 ∂µ 4 Experiments For inference, we used either loopy belief propagation, or tree-reweighted belief propagation. [sent-200, score-0.288]
79 Base code was implemented in Python, with C++ extensions for inference algorithms for efficiency. [sent-205, score-0.113]
80 14-25] to be much slower than the more sophisticated SEQ_FIX mode, based on sequential updates [6, extended 7 Table 1: Binary denoising results, comparing the surrogate likelihood against three loss functions fit by implicit differentiation. [sent-212, score-0.47]
81 All loss functions are per-pixel, based on treereweighted belief propagation with edge inclusion probabilities of . [sent-213, score-0.318]
82 The “Best Published” results are the lowest previously reported pixelwise test errors using essentially loopy-belief propagation based surrogate likelihood. [sent-215, score-0.25]
83 ) Test Loss Training Loss Surrogate likelihood Clique likelihood Univariate likelihood Smooth Class. [sent-217, score-0.357]
84 Error likelihood likelihood likelihood Error Train Test Train Test Train Test Train Test Train Test Train Test . [sent-221, score-0.357]
85 With the Bethe approximation, perturbation performs similarly to BBP. [sent-286, score-0.119]
86 Enforcing translation invariance gives a total of eight free parameters: four for a(yi , yj ), and two for b(yi ), and c(yi )1 . [sent-293, score-0.103]
87 14, recover derivatives with respect to tied dθ parameters2. [sent-295, score-0.13]
88 These are binarized by setting yi = 1 if a pixel is above the image mean, and yi = 0 1. [sent-298, score-0.206]
89 The noisy values xi are created by setting xi = yi (1 − t1. [sent-300, score-0.219]
90 The surrogate likelihood performs well, in fact beating the best reported results on the bimodal and Gaussian data. [sent-305, score-0.348]
91 However, the univariate and clique loss functions provide better univariate accuracy. [sent-306, score-0.682]
92 The surrogate likelihood (which is convex), was used to initialize the univariate and clique likelihood, while the univariate likelihood was used to initialize the smooth classification error. [sent-309, score-0.961]
93 Belief optimization for binary networks: A stable alternative to loopy belief propagation. [sent-332, score-0.179]
94 CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. [sent-338, score-0.16]
95 Constructing free energy approximations and generalized belief propagation algorithms. [sent-347, score-0.256]
96 Convexity arguments for efficient minimization of the bethe and kikuchi free energies. [sent-356, score-0.424]
97 Constrained approximate maximum entropy learning of markov random fields. [sent-363, score-0.25]
98 Convergent message-passing algorithms for inference over general graphs with convex free energies. [sent-366, score-0.197]
99 Exploiting inference for approximate parameter learning in discriminative fields: An empirical study. [sent-386, score-0.166]
100 Accelerated training of conditional random fields with stochastic gradient methods. [sent-395, score-0.095]
wordName wordTfidf (topN-words)
[('dl', 0.333), ('bethe', 0.323), ('bbp', 0.257), ('univariate', 0.231), ('di', 0.215), ('entropy', 0.197), ('marginals', 0.192), ('erentiation', 0.171), ('trw', 0.171), ('surrogate', 0.154), ('trws', 0.143), ('perturbation', 0.119), ('likelihood', 0.119), ('entropies', 0.117), ('loss', 0.113), ('inference', 0.113), ('belief', 0.109), ('polytope', 0.109), ('clique', 0.107), ('marginal', 0.106), ('yi', 0.103), ('erent', 0.102), ('erentiable', 0.1), ('propagation', 0.096), ('symmlq', 0.086), ('implicit', 0.084), ('grid', 0.079), ('bimodal', 0.075), ('heskes', 0.075), ('loopy', 0.07), ('perturbed', 0.066), ('bp', 0.064), ('uai', 0.062), ('convergent', 0.061), ('tied', 0.061), ('ci', 0.061), ('xi', 0.058), ('gradient', 0.058), ('concave', 0.057), ('domke', 0.057), ('libdai', 0.057), ('pert', 0.057), ('sanjiv', 0.057), ('bd', 0.056), ('graphical', 0.055), ('treewidth', 0.054), ('approximate', 0.053), ('strengths', 0.053), ('yj', 0.052), ('free', 0.051), ('kikuchi', 0.05), ('base', 0.047), ('running', 0.047), ('justin', 0.046), ('derivative', 0.045), ('misspeci', 0.043), ('gross', 0.043), ('erence', 0.043), ('erences', 0.043), ('martial', 0.043), ('parametrized', 0.043), ('ganapathi', 0.041), ('message', 0.04), ('whye', 0.039), ('yee', 0.039), ('calculating', 0.039), ('arg', 0.039), ('solvers', 0.038), ('train', 0.038), ('derivatives', 0.038), ('calls', 0.038), ('conditional', 0.037), ('martin', 0.036), ('twice', 0.036), ('varying', 0.035), ('beliefs', 0.035), ('da', 0.035), ('inde', 0.035), ('amir', 0.035), ('yair', 0.035), ('obey', 0.034), ('graphs', 0.033), ('interaction', 0.033), ('globerson', 0.033), ('berkeley', 0.033), ('messages', 0.032), ('direct', 0.032), ('local', 0.032), ('tom', 0.031), ('recover', 0.031), ('inversion', 0.031), ('equality', 0.03), ('versions', 0.03), ('alan', 0.03), ('max', 0.03), ('poorly', 0.03), ('extremely', 0.03), ('paired', 0.029), ('kumar', 0.029), ('kakade', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 118 nips-2010-Implicit Differentiation by Perturbation
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
2 0.14345311 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
3 0.12920478 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
4 0.12462141 144 nips-2010-Learning Efficient Markov Networks
Author: Vibhav Gogate, William Webb, Pedro Domingos
Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1
5 0.11747117 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
Author: Matthias Broecheler, Lise Getoor
Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1
6 0.11156607 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
7 0.10905297 283 nips-2010-Variational Inference over Combinatorial Spaces
8 0.098431095 257 nips-2010-Structured Determinantal Point Processes
9 0.094414301 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
10 0.090644389 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
11 0.085507311 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification
12 0.08101584 53 nips-2010-Copula Bayesian Networks
13 0.079878889 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio
14 0.078046724 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
15 0.076489545 101 nips-2010-Gaussian sampling by local perturbations
16 0.075677298 177 nips-2010-Multitask Learning without Label Correspondences
17 0.073604159 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
18 0.073381826 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
19 0.072199076 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
20 0.070637897 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
topicId topicWeight
[(0, 0.206), (1, 0.053), (2, 0.054), (3, 0.006), (4, -0.085), (5, 0.006), (6, -0.033), (7, 0.052), (8, 0.054), (9, -0.026), (10, -0.196), (11, -0.066), (12, 0.09), (13, 0.005), (14, -0.089), (15, -0.033), (16, -0.001), (17, 0.03), (18, 0.053), (19, -0.012), (20, -0.025), (21, -0.033), (22, 0.082), (23, -0.124), (24, -0.055), (25, -0.14), (26, 0.083), (27, -0.114), (28, 0.101), (29, 0.018), (30, 0.014), (31, -0.073), (32, -0.031), (33, 0.0), (34, -0.023), (35, 0.032), (36, -0.144), (37, -0.044), (38, 0.04), (39, -0.051), (40, 0.015), (41, 0.009), (42, 0.002), (43, -0.034), (44, 0.09), (45, -0.057), (46, 0.004), (47, -0.016), (48, 0.026), (49, -0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.94181228 118 nips-2010-Implicit Differentiation by Perturbation
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
2 0.75133079 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
Author: Tamir Hazan, Raquel Urtasun
Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
3 0.69014442 283 nips-2010-Variational Inference over Combinatorial Spaces
Author: Alexandre Bouchard-côté, Michael I. Jordan
Abstract: Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems [1, 2, 3], theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference [4]. Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models [5]; unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset [6]. 1
4 0.6651575 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
5 0.65693724 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models
Author: Danny Bickson, Carlos Guestrin
Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1
6 0.60828394 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
7 0.60536361 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
8 0.5955497 257 nips-2010-Structured Determinantal Point Processes
9 0.57355946 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features
10 0.56196815 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
11 0.55792046 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
12 0.5541001 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
13 0.54946339 61 nips-2010-Direct Loss Minimization for Structured Prediction
14 0.54763407 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
15 0.53535426 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
16 0.53404367 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
17 0.50102317 202 nips-2010-Parallelized Stochastic Gradient Descent
18 0.49891832 53 nips-2010-Copula Bayesian Networks
19 0.49627772 54 nips-2010-Copula Processes
20 0.47792566 144 nips-2010-Learning Efficient Markov Networks
topicId topicWeight
[(13, 0.035), (17, 0.041), (27, 0.038), (30, 0.086), (45, 0.249), (46, 0.244), (50, 0.087), (52, 0.034), (60, 0.045), (77, 0.047), (90, 0.015)]
simIndex simValue paperId paperTitle
1 0.83824652 271 nips-2010-Tiled convolutional neural networks
Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng
Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights significantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efficient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1
same-paper 2 0.82216996 118 nips-2010-Implicit Differentiation by Perturbation
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
3 0.75433707 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
Author: Matthias Broecheler, Lise Getoor
Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1
4 0.75068635 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
Author: Rina Foygel, Mathias Drton
Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1
5 0.74894387 155 nips-2010-Learning the context of a category
Author: Dan Navarro
Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.
6 0.74871296 63 nips-2010-Distributed Dual Averaging In Networks
7 0.74866939 257 nips-2010-Structured Determinantal Point Processes
8 0.74755168 148 nips-2010-Learning Networks of Stochastic Differential Equations
9 0.74661815 287 nips-2010-Worst-Case Linear Discriminant Analysis
10 0.7464425 202 nips-2010-Parallelized Stochastic Gradient Descent
11 0.74545401 243 nips-2010-Smoothness, Low Noise and Fast Rates
12 0.7441262 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.74242795 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
14 0.74220675 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
15 0.74209392 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
16 0.7415123 216 nips-2010-Probabilistic Inference and Differential Privacy
17 0.74112576 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
18 0.74036574 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
19 0.74029368 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
20 0.73980653 290 nips-2010-t-logistic regression