nips nips2010 nips2010-118 knowledge-graph by maker-knowledge-mining

118 nips-2010-Implicit Differentiation by Perturbation

Source: pdf

Author: Justin Domke

Abstract: This paper proposes a simple and eﬃcient ﬁnite diﬀerence method for implicit diﬀerentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, deﬁned on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to ﬁt graphical models with hidden variables, high treewidth and/or model misspeciﬁcation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper proposes a simple and eﬃcient ﬁnite diﬀerence method for implicit diﬀerentiation of marginal inference results in discrete graphical models. [sent-3, score-0.358]

2 Given an arbitrary loss function, deﬁned on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. [sent-4, score-0.49]

3 This method can be used with approximate inference, with a loss function over approximate marginals. [sent-5, score-0.219]

4 Convenient choices of loss functions make it practical to ﬁt graphical models with hidden variables, high treewidth and/or model misspeciﬁcation. [sent-6, score-0.222]

5 Though the likelihood and conditional likelihood are the most widespread training objectives, these are sometimes undesirable and/or infeasible in real applications. [sent-8, score-0.275]

6 With low treewidth, if the data is truly distributed according to the chosen graphical model with some parameters, any consistent loss function will recover those true parameters in the high-data limit, and so one might select a loss function according to statistical convergence rates [1]. [sent-9, score-0.312]

7 In this case, diﬀerent loss functions lead to diﬀerent asymptotic parameter estimates. [sent-11, score-0.113]

8 For lowtreewidth graphs, several loss functions have been proposed that prioritize diﬀerent types of accuracy (section 2. [sent-13, score-0.113]

9 For parameters θ, these loss functions are given as a function ∂L L(µ(θ)) of marginals µ(θ). [sent-15, score-0.305]

10 The likelihood may also be infeasible to optimize, due to the computational intractability of computing the log-partition function or its derivatives in high treewidth graphs. [sent-18, score-0.211]

11 On the other hand, if an approximate inference algorithm will be used at test time, it is logical to design the loss function to compensate for defects in inference. [sent-19, score-0.279]

12 The surrogate likelihood (the likelihood with an approximate partition function) can give superior results to the true likelihood, when approximate inference is used at test time[4]. [sent-20, score-0.611]

13 If µ(θ) is the function mapping parameters to (approximate) marginals, and there is some loss function L(µ) deﬁned on those marginals, we desire to recover dL . [sent-22, score-0.144]

14 This enables the use of the marginal-based loss functions mentioned dθ previously, but deﬁned on approximate marginals. [sent-23, score-0.166]

15 The major disadvantage of this approach is that standard linear solvers can perform poorly on large 1 True (y) Noisy (x) Surrogate likelihood Clique likelihood Univariate likelihood Smooth class. [sent-26, score-0.425]

16 error Figure 1: Example images from the Berkeley dataset, along with marginals for a conditional random ﬁeld ﬁt with various loss functions. [sent-27, score-0.342]

17 graphs, meaning that calculating this gradient can be more expensive than performing inference (Section 4). [sent-28, score-0.21]

18 However, it is tied to a speciﬁc entropy approximation (Bethe) and algorithm (Loopy Belief Propagation). [sent-33, score-0.258]

19 Extension to similar message-passing algorithms appears possible, but extension to more complex inference algorithms [7, 8, 9] is unclear. [sent-34, score-0.113]

20 Here, we observe that the loss gradient can be calculated by far more straightforward means. [sent-35, score-0.171]

21 This result follows from, ﬁrst, the well-known trick of approximating Jacobianvector products by ﬁnite diﬀerences and, second, the special property that for marginal dµ inference, the Jacobian matrix dθT is symmetric. [sent-37, score-0.106]

22 This result applies when marginal inference takes place over the local polytope with an entropy that is concave and obeys a minor technical condition. [sent-38, score-0.614]

23 It can also be used with non-concave entropies, with some assumptions on how inference recovers diﬀerent local optima. [sent-39, score-0.145]

24 It is easy to use this to compute the gradient of essentially any diﬀerentiable loss function deﬁned on marginals. [sent-40, score-0.171]

25 Eﬀectively, all one needs to do is re-run the inference procedure on a set of parameters slightly "perturbed" in the direction ∂L . [sent-41, score-0.113]

26 Aside from this, like the matrix inversion approach, it is independent of the algorithm used to perform independence, and applicable to a variety of diﬀerent inference approximations. [sent-44, score-0.144]

27 1 Background Marginal Inference This section brieﬂy reviews the aspects of graphical models and marginal inference that are required for the rest of the paper. [sent-47, score-0.274]

28 (2) Marginal inference means recovering the expected value of f or, equivalently, the marginal probability that each factor or variable have a particular value. [sent-56, score-0.219]

29 µ(θ) = p(x; θ)f (x) (3) x Though marginals could, in principle, be computed by the brute-force sum in Eq. [sent-57, score-0.192]

30 3, it is useful to consider the paired variational representation [10, Chapter 3] A(θ) = µ(θ) = dA dθ = max θ · µ + H(µ) (4) arg max θ · µ + H(µ), (5) µ∈M µ∈M in which A and µ can both be recovered from solving the same optimization problem. [sent-58, score-0.128]

31 Here, M = {µ(θ)|θ ∈ ℜn } is the marginal polytope– those marginals µ resulting from some parameter vector θ. [sent-59, score-0.298]

32 Similarly, H(µ) is the entropy of p(x; θ ′ ), where θ′ is the vector of parameters that produces the marginals µ. [sent-60, score-0.389]

33 A variety of approximate inference methods can be seen as solving a modiﬁcation of Eqs. [sent-64, score-0.166]

34 4 and 5, with the marginal polytope and entropy replaced with tractable approximations. [sent-65, score-0.412]

35 Notice that these are also paired; the approximate µ is the exact gradient of the approximate A. [sent-66, score-0.164]

36 The commonest relaxation of M is the local polytope L = {µ ≥ 0 | µ(xi ) = µ(xi ) = 1}. [sent-67, score-0.141]

37 µ(xα ), xα\i (6) xi This underlies loopy belief propagation, as well as tree-reweighted belief propagation. [sent-68, score-0.346]

38 Since a valid set of marginals must obey these constraints, L ⊇ M. [sent-69, score-0.226]

39 The Bethe approximation implicit in loopy belief propagation [11] is non-concave in general, which results in sometimes failing to achieve the global optimum. [sent-72, score-0.359]

40 Concave entropy functions include the tree-reweighted entropy [12], convexiﬁed Bethe entropies [13], and the class of entropies obeying Heskes’ conditions [14]. [sent-73, score-0.628]

41 The (negative) likelihood is the classic loss function for training graphical models. [sent-77, score-0.287]

42 The surrogate likelihood [4] is simply the likelihood as in Eq. [sent-85, score-0.392]

43 Unlike the losses below, the surrogate likelihood is convex when based on a concave inference method. [sent-89, score-0.443]

44 [15] for a variant of this for inference with local optima. [sent-91, score-0.145]

45 If the application will only make use of univariate marginals at test time, one might ﬁt parameters speciﬁcally to make these univariate marginals accurate. [sent-93, score-0.846]

46 [3] proposed the loss L(ˆ ; θ) = − x log µ(ˆi ; θ). [sent-95, score-0.113]

47 x (11) i This can be computed in treelike graphs, after running belief propagation to compute marginals. [sent-96, score-0.252]

48 [2] considered the loss L(ˆ ; θ) = x i S max µ(xi ; θ) − µ(ˆi ; θ) , x xi =ˆi x (12) where S is the step function. [sent-101, score-0.201]

49 This loss measures the number of incorrect components of ˆ x if each is predicted to be the “max marginal”. [sent-102, score-0.113]

50 As with the univariate likelihood, this loss can be computed if exact marginals are available. [sent-105, score-0.536]

51 One can easily deﬁne clique versions of the previous two loss functions, where the summations are over α, rather than i. [sent-108, score-0.25]

52 These measure the accuracy of clique-wise marginals, rather than univariate marginals. [sent-109, score-0.231]

53 7, the equality constraints in the local polytope are linear, and hence when the positivity constraint can be disregarded, approximate marginal inference algorithms can be seen as solving the optimization µ(θ) = arg maxµ,Bµ=d θ · µ + H(µ). [sent-112, score-0.482]

54 Domke showed[5], in our notation, that dL dL = D−1 B T (BD−1 B T )−1 BD−1 − D−1 , dθ dµ where D = ∂2H ∂µ∂µT (13) is the (diagonal) Hessian of the entropy approximation. [sent-113, score-0.197]

55 Nonlinear and tied parameters are easily dealt with by considering θ(φ) to be a function of the “true” 4 Algorithm 1 Calculating loss derivatives (two-sided). [sent-122, score-0.212]

56 ∂L µ+ ← arg max (θ + r ) · µ + H(µ) µ∈M ∂µ µ− ← arg max (θ − r µ∈M ∂L ) · µ + H(µ) ∂µ 1 + dL ← (µ − µ− ) dθ 2r 5. [sent-131, score-0.138]

57 (14) dφ dφ dθ Conditional training is similar: deﬁne a distribution over a random variable y, parametrized by θ(φ; x), the derivative on a particular pair (x, y) is given again by Eq. [sent-135, score-0.088]

58 3 Implicit Diﬀerentiation by Perturbation This section shows that when µ(θ) = arg maxµ∈L θ · µ + H(µ), the loss gradient can be computed by Alg. [sent-138, score-0.21]

59 1 for a concave entropy approximation of the form H(µ) = − cα α xα µ(xα ) log µ(xα ) − µ(xi ) log µ(xi ), ci i (15) xi when the counting numbers c obey (as is true of most proposed entropies) (16) cα > 0. [sent-139, score-0.407]

60 cα > 0, ci + α,i∈α For intuition, the following Lemma uses notation (µ, θ, H) suggesting the application to marginal inference. [sent-140, score-0.167]

61 t arg max µ · θ + H(µ) (17) Bµ − d = 0, (18) µ where H(µ) is strictly convex and twice diﬀerentiable, then dµ exists and is symmetric. [sent-144, score-0.105]

62 θ + ∂H(µ)/∂µ + B T λ Bµ − d 5 = 0 0 (20) Recall the general implicit function theorem. [sent-148, score-0.084]

63 Again, this uses notation suggesting the application to implicit diﬀerentiation and marginal inference, but holds true for any functions satisfying the stated conditions. [sent-155, score-0.19]

64 1 follows from applying this theorem to marginal inference. [sent-165, score-0.106]

65 If H(µ) = α cα H(µc ) + i ci H(µi ), and µ∗ is a (possibly local) maximum of θ · µ + H(µ), under the local polytope L, then cα > 0, ci + α,i∈α cα > 0 −→ µ∗ > 0. [sent-169, score-0.263]

66 With a non-concave entropy this condition is still valid, not not unique, since there can be local optima. [sent-176, score-0.229]

67 BBP essentially calculates this 6 3 10 Bethe entropy 2 2 10 running time (s) running time (s) Bethe entropy 3 10 1 10 0 10 −1 10 10 1 10 0 10 pert−BP symmlq BBP direct BP −1 10 −2 −2 10 10 8 32 128 0. [sent-177, score-0.606]

68 5 512 grid size 3 10 TRW entropy 2 TRW entropy 3 10 2 2 10 running time (s) running time (s) 1 interaction strength 1 10 0 10 −1 10 10 1 10 0 10 pert−TRWS symmlq direct TRWS −1 10 −2 −2 10 10 8 32 128 0. [sent-178, score-0.718]

69 5 512 grid size 1 interaction strength 2 Figure 2: Times to compute dL/dθ by perturbation, Back Belief Propagation (BBP), sparse matrix factorization (direct) and the iterative symmetric-LQ method (symmlq). [sent-179, score-0.112]

70 As these results use two-sided diﬀerences, perturbation always takes twice the running time of the base inference algorithm. [sent-181, score-0.362]

71 , 5}, with univariate terms θ(xi ) taken uniformly from [−1, +1] and interaction strengths θ(xi , xj ) from [−a, +a] for varying a. [sent-186, score-0.352]

72 Top Left: Bethe entropy for varying grid sizes, with a = 1. [sent-187, score-0.311]

73 Top Right: Bethe entropy with a grid size of 32 and varying interaction strengths a. [sent-189, score-0.397]

74 Bottom Left: Varying grid sizes with the TRW entropy. [sent-191, score-0.079]

75 Bottom Right: TRW entropy with a grid size of 32 and varying interactions. [sent-192, score-0.311]

76 If perturbed beliefs are calculated from constant initial messages with a small step, one obtains the same result. [sent-194, score-0.133]

77 Thus, BBP and perturbation give the same gradient for the Bethe approximation. [sent-195, score-0.177]

78 ∂µ 4 Experiments For inference, we used either loopy belief propagation, or tree-reweighted belief propagation. [sent-200, score-0.288]

79 Base code was implemented in Python, with C++ extensions for inference algorithms for eﬃciency. [sent-205, score-0.113]

80 14-25] to be much slower than the more sophisticated SEQ_FIX mode, based on sequential updates [6, extended 7 Table 1: Binary denoising results, comparing the surrogate likelihood against three loss functions ﬁt by implicit diﬀerentiation. [sent-212, score-0.47]

81 All loss functions are per-pixel, based on treereweighted belief propagation with edge inclusion probabilities of . [sent-213, score-0.318]

82 The “Best Published” results are the lowest previously reported pixelwise test errors using essentially loopy-belief propagation based surrogate likelihood. [sent-215, score-0.25]

83 ) Test Loss Training Loss Surrogate likelihood Clique likelihood Univariate likelihood Smooth Class. [sent-217, score-0.357]

84 Error likelihood likelihood likelihood Error Train Test Train Test Train Test Train Test Train Test Train Test . [sent-221, score-0.357]

85 With the Bethe approximation, perturbation performs similarly to BBP. [sent-286, score-0.119]

86 Enforcing translation invariance gives a total of eight free parameters: four for a(yi , yj ), and two for b(yi ), and c(yi )1 . [sent-293, score-0.103]

87 14, recover derivatives with respect to tied dθ parameters2. [sent-295, score-0.13]

88 These are binarized by setting yi = 1 if a pixel is above the image mean, and yi = 0 1. [sent-298, score-0.206]

89 The noisy values xi are created by setting xi = yi (1 − t1. [sent-300, score-0.219]

90 The surrogate likelihood performs well, in fact beating the best reported results on the bimodal and Gaussian data. [sent-305, score-0.348]

91 However, the univariate and clique loss functions provide better univariate accuracy. [sent-306, score-0.682]

92 The surrogate likelihood (which is convex), was used to initialize the univariate and clique likelihood, while the univariate likelihood was used to initialize the smooth classiﬁcation error. [sent-309, score-0.961]

93 Belief optimization for binary networks: A stable alternative to loopy belief propagation. [sent-332, score-0.179]

94 CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. [sent-338, score-0.16]

95 Constructing free energy approximations and generalized belief propagation algorithms. [sent-347, score-0.256]

96 Convexity arguments for eﬃcient minimization of the bethe and kikuchi free energies. [sent-356, score-0.424]

97 Constrained approximate maximum entropy learning of markov random ﬁelds. [sent-363, score-0.25]

98 Convergent message-passing algorithms for inference over general graphs with convex free energies. [sent-366, score-0.197]

99 Exploiting inference for approximate parameter learning in discriminative ﬁelds: An empirical study. [sent-386, score-0.166]

100 Accelerated training of conditional random ﬁelds with stochastic gradient methods. [sent-395, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dl', 0.333), ('bethe', 0.323), ('bbp', 0.257), ('univariate', 0.231), ('di', 0.215), ('entropy', 0.197), ('marginals', 0.192), ('erentiation', 0.171), ('trw', 0.171), ('surrogate', 0.154), ('trws', 0.143), ('perturbation', 0.119), ('likelihood', 0.119), ('entropies', 0.117), ('loss', 0.113), ('inference', 0.113), ('belief', 0.109), ('polytope', 0.109), ('clique', 0.107), ('marginal', 0.106), ('yi', 0.103), ('erent', 0.102), ('erentiable', 0.1), ('propagation', 0.096), ('symmlq', 0.086), ('implicit', 0.084), ('grid', 0.079), ('bimodal', 0.075), ('heskes', 0.075), ('loopy', 0.07), ('perturbed', 0.066), ('bp', 0.064), ('uai', 0.062), ('convergent', 0.061), ('tied', 0.061), ('ci', 0.061), ('xi', 0.058), ('gradient', 0.058), ('concave', 0.057), ('domke', 0.057), ('libdai', 0.057), ('pert', 0.057), ('sanjiv', 0.057), ('bd', 0.056), ('graphical', 0.055), ('treewidth', 0.054), ('approximate', 0.053), ('strengths', 0.053), ('yj', 0.052), ('free', 0.051), ('kikuchi', 0.05), ('base', 0.047), ('running', 0.047), ('justin', 0.046), ('derivative', 0.045), ('misspeci', 0.043), ('gross', 0.043), ('erence', 0.043), ('erences', 0.043), ('martial', 0.043), ('parametrized', 0.043), ('ganapathi', 0.041), ('message', 0.04), ('whye', 0.039), ('yee', 0.039), ('calculating', 0.039), ('arg', 0.039), ('solvers', 0.038), ('train', 0.038), ('derivatives', 0.038), ('calls', 0.038), ('conditional', 0.037), ('martin', 0.036), ('twice', 0.036), ('varying', 0.035), ('beliefs', 0.035), ('da', 0.035), ('inde', 0.035), ('amir', 0.035), ('yair', 0.035), ('obey', 0.034), ('graphs', 0.033), ('interaction', 0.033), ('globerson', 0.033), ('berkeley', 0.033), ('messages', 0.032), ('direct', 0.032), ('local', 0.032), ('tom', 0.031), ('recover', 0.031), ('inversion', 0.031), ('equality', 0.03), ('versions', 0.03), ('alan', 0.03), ('max', 0.03), ('poorly', 0.03), ('extremely', 0.03), ('paired', 0.029), ('kumar', 0.029), ('kakade', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 118 nips-2010-Implicit Differentiation by Perturbation

Author: Justin Domke

2 0.14345311 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

Author: Tamir Hazan, Raquel Urtasun

Abstract: In this paper we propose an approximated structured prediction framework for large scale graphical models and derive message-passing algorithms for learning their parameters efﬁciently. We ﬁrst relate CRFs and structured SVMs and show that in CRFs a variant of the log-partition function, known as the soft-max, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efﬁcient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efﬁciently graphical models with cycles and very large number of parameters. 1

3 0.12920478 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

Author: Iain Murray, Ryan P. Adams

Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be speciﬁed using unknown hyperparameters. Integrating over these hyperparameters considers diﬀerent possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1

4 0.12462141 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-speciﬁc independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efﬁcient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

5 0.11747117 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

Author: Matthias Broecheler, Lise Getoor

Abstract: Continuous Markov random ﬁelds are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random ﬁeld. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efﬁciency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classiﬁcation, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of conﬁdence. 1

6 0.11156607 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

7 0.10905297 283 nips-2010-Variational Inference over Combinatorial Spaces

8 0.098431095 257 nips-2010-Structured Determinantal Point Processes

9 0.094414301 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

10 0.090644389 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

11 0.085507311 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

12 0.08101584 53 nips-2010-Copula Bayesian Networks

13 0.079878889 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

14 0.078046724 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

15 0.076489545 101 nips-2010-Gaussian sampling by local perturbations

16 0.075677298 177 nips-2010-Multitask Learning without Label Correspondences

17 0.073604159 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

18 0.073381826 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

19 0.072199076 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

20 0.070637897 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.053), (2, 0.054), (3, 0.006), (4, -0.085), (5, 0.006), (6, -0.033), (7, 0.052), (8, 0.054), (9, -0.026), (10, -0.196), (11, -0.066), (12, 0.09), (13, 0.005), (14, -0.089), (15, -0.033), (16, -0.001), (17, 0.03), (18, 0.053), (19, -0.012), (20, -0.025), (21, -0.033), (22, 0.082), (23, -0.124), (24, -0.055), (25, -0.14), (26, 0.083), (27, -0.114), (28, 0.101), (29, 0.018), (30, 0.014), (31, -0.073), (32, -0.031), (33, 0.0), (34, -0.023), (35, 0.032), (36, -0.144), (37, -0.044), (38, 0.04), (39, -0.051), (40, 0.015), (41, 0.009), (42, 0.002), (43, -0.034), (44, 0.09), (45, -0.057), (46, 0.004), (47, -0.016), (48, 0.026), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94181228 118 nips-2010-Implicit Differentiation by Perturbation

Author: Justin Domke

2 0.75133079 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

Author: Tamir Hazan, Raquel Urtasun

3 0.69014442 283 nips-2010-Variational Inference over Combinatorial Spaces

Author: Alexandre Bouchard-côté, Michael I. Jordan

Abstract: Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems [1, 2, 3], theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference [4]. Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models [5]; unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset [6]. 1

4 0.6651575 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits

Author: Daniel Lowd, Pedro Domingos

Abstract: Arithmetic circuits (ACs) exploit context-speciﬁc independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the ﬁrst ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we ﬁnd that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean ﬁeld, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1

5 0.65693724 126 nips-2010-Inference with Multivariate Heavy-Tails in Linear Models

Author: Danny Bickson, Carlos Guestrin

Abstract: Heavy-tailed distributions naturally occur in many real life problems. Unfortunately, it is typically not possible to compute inference in closed-form in graphical models which involve such heavy-tailed distributions. In this work, we propose a novel simple linear graphical model for independent latent random variables, called linear characteristic model (LCM), defined in the characteristic function domain. Using stable distributions, a heavy-tailed family of distributions which is a generalization of Cauchy, L´ vy and Gaussian distrie butions, we show for the first time, how to compute both exact and approximate inference in such a linear multivariate graphical model. LCMs are not limited to stable distributions, in fact LCMs are always defined for any random variables (discrete, continuous or a mixture of both). We provide a realistic problem from the field of computer networks to demonstrate the applicability of our construction. Other potential application is iterative decoding of linear channels with non-Gaussian noise. 1

6 0.60828394 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

7 0.60536361 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs

8 0.5955497 257 nips-2010-Structured Determinantal Point Processes

9 0.57355946 159 nips-2010-Lifted Inference Seen from the Other Side : The Tractable Features

10 0.56196815 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

11 0.55792046 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

12 0.5541001 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs

13 0.54946339 61 nips-2010-Direct Loss Minimization for Structured Prediction

14 0.54763407 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

15 0.53535426 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage

16 0.53404367 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

17 0.50102317 202 nips-2010-Parallelized Stochastic Gradient Descent

18 0.49891832 53 nips-2010-Copula Bayesian Networks

19 0.49627772 54 nips-2010-Copula Processes

20 0.47792566 144 nips-2010-Learning Efficient Markov Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.035), (17, 0.041), (27, 0.038), (30, 0.086), (45, 0.249), (46, 0.244), (50, 0.087), (52, 0.034), (60, 0.045), (77, 0.047), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83824652 271 nips-2010-Tiled convolutional neural networks

Author: Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W. Koh, Quoc V. Le, Andrew Y. Ng

Abstract: Convolutional neural networks (CNNs) have been successfully applied to many tasks such as digit and object recognition. Using convolutional (tied) weights signiﬁcantly reduces the number of parameters that have to be learned, and also allows translational invariance to be hard-coded into the architecture. In this paper, we consider the problem of learning invariances, rather than relying on hardcoding. We propose tiled convolution neural networks (Tiled CNNs), which use a regular “tiled” pattern of tied weights that does not require that adjacent hidden units share identical weights, but instead requires only that hidden units k steps away from each other to have tied weights. By pooling over neighboring units, this architecture is able to learn complex invariances (such as scale and rotational invariance) beyond translational invariance. Further, it also enjoys much of CNNs’ advantage of having a relatively small number of learned parameters (such as ease of learning and greater scalability). We provide an efﬁcient learning algorithm for Tiled CNNs based on Topographic ICA, and show that learning complex invariant features allows us to achieve highly competitive results for both the NORB and CIFAR-10 datasets. 1

same-paper 2 0.82216996 118 nips-2010-Implicit Differentiation by Perturbation

Author: Justin Domke

3 0.75433707 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

Author: Matthias Broecheler, Lise Getoor

4 0.75068635 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

Author: Rina Foygel, Mathias Drton

Abstract: Gaussian graphical models with sparsity in the inverse covariance matrix are of signiﬁcant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjunction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n. 1

5 0.74894387 155 nips-2010-Learning the context of a category

Author: Dan Navarro

Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is ﬁt to multiple data sets, and provides a parsimonious method for describing how humans learn context speciﬁc conceptual representations.

6 0.74871296 63 nips-2010-Distributed Dual Averaging In Networks

7 0.74866939 257 nips-2010-Structured Determinantal Point Processes

8 0.74755168 148 nips-2010-Learning Networks of Stochastic Differential Equations

9 0.74661815 287 nips-2010-Worst-Case Linear Discriminant Analysis

10 0.7464425 202 nips-2010-Parallelized Stochastic Gradient Descent

11 0.74545401 243 nips-2010-Smoothness, Low Noise and Fast Rates

12 0.7441262 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

13 0.74242795 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs

14 0.74220675 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

15 0.74209392 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

16 0.7415123 216 nips-2010-Probabilistic Inference and Differential Privacy

17 0.74112576 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

18 0.74036574 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

19 0.74029368 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

20 0.73980653 290 nips-2010-t-logistic regression