nips nips2012 nips2012-105 knowledge-graph by maker-knowledge-mining

105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction


Source: pdf

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 at Abstract We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. [sent-7, score-0.489]

2 MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. [sent-8, score-0.242]

3 In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. [sent-9, score-0.255]

4 Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. [sent-11, score-0.269]

5 Consequently, at any time only samples of variables whose decision is still uncertain need to be created. [sent-12, score-0.383]

6 Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. [sent-13, score-0.516]

7 Unfortunately, inference by sampling is often computationally costly: many samples are required to reach a confident result, and generating the individual samples can be a complex task in itself, in particular if the underlying graphical model is large and highly connected. [sent-17, score-0.472]

8 In this work we study a particular inference problem: maximum marginal prediction (MMP) in binary-valued PGMs, i. [sent-18, score-0.296]

9 the task of determining for each variable in the graphical model which of its states has highest marginal probability. [sent-20, score-0.308]

10 MMP occurs naturally as the Bayes optimal decision rule under Hamming loss [8], and it has also found use as a building block for more complex prediction tasks, such as M -best MAP prediction [9]. [sent-21, score-0.295]

11 The standard approach to sampling-based MMP is to estimate each variable’s marginal probability distribution from a set of samples from the joint probability, and for each variable pick the state of highest estimated marginal probability. [sent-22, score-0.479]

12 We introduce one binary indicator variable for each decision we need to make, and keep estimates of the posterior probabilities of each of these during the process of sampling. [sent-24, score-0.313]

13 As soon as we are confident enough about any of 1 the decisions, we remove it from the factor graph that underlies the sampling process, so no more samples are generated for it. [sent-25, score-0.362]

14 Consequently, the factor graph shrinks over time, and later steps in the sampling procedure are accelerated, often drastically so. [sent-26, score-0.318]

15 Our main contribution lies in the combination of two relatively elementary components that we will introduce in the following section: an estimate for the posterior distributions of the decision variables, and a mean field-like construction for removing individual variables from a factor graph. [sent-27, score-0.429]

16 We assume that p is given to us by means of a factor graph, G = (V, F), with factor set F = {F1 . [sent-32, score-0.172]

17 to infer the values of decision variables (zi )i∈V that are defined by zi := 0 if µi ≤ 0. [sent-41, score-0.288]

18 5, and zi := 1 otherwise, where µi := p(xi = 1) is the marginal probability of the ith variable taking the value 1. [sent-42, score-0.27]

19 Computing the marginals µi in a loopy graphical model is in general #P-complete [10], so one has to settle for approximate marginals and approximate predictions. [sent-43, score-0.261]

20 Under mild ˆ ˆ conditions on the sampling procedure the law of large number guarantees that limN →∞ µi = µi , ˆ and the decisions will become correct almost surely. [sent-52, score-0.26]

21 The main problem with sampling-based inference is when to stop sampling [13]. [sent-53, score-0.246]

22 However, each sample we generate increases the computational cost at least proportionally to the numbers of factors and variables involved. [sent-55, score-0.288]

23 In the rest of this section, we explain our proposed idea of adaptive sampling in graphical models, which reduces the number of variables and factors during the course of the sampling procedure. [sent-58, score-0.745]

24 As an illustrative example we start by the classical situation of adaptive sampling in the case of a single binary variable. [sent-59, score-0.441]

25 data– has recently also been rediscovered in the pattern recognition literature, for example for evaluating decision trees [14]. [sent-63, score-0.165]

26 We then introduce our proposed extensions to correlated samples, and show how the per-variable decisions can be applied in the graphical model situation with potentially many variables and dependencies between them. [sent-64, score-0.379]

27 Ultimately, we are only interested in the value of the associated decision variable z. [sent-71, score-0.214]

28 The normalization factor B(α, β) = Γ(α+β) is the beta function; the integral is called the incomplete beta function (here evaluated at 1 ). [sent-78, score-0.268]

29 If its value is above 1 − , we are -confident that the correct decision is z = 0. [sent-81, score-0.165]

30 If it is below , we are equally confident that the correct decision is z = 1. [sent-82, score-0.165]

31 An analogue derivation to the above leads to a confidence bound for estimates of the marginal probability, µ = m/N , itself: ˆ p(|ˆ − µ| ≤ δ|S) = Iµ+δ (m+1, N −m+1) − Iµ−δ (m+1, N −m+1). [sent-84, score-0.177]

32 Intuitively, this makes sense, since in this situation a even coarse estimate of the marginal is sufficient to make of a decision with low error probability. [sent-89, score-0.417]

33 Only if the true marginal lies inside of a relatively narrow interval around 0. [sent-90, score-0.244]

34 5, the MMP decision becomes hard, and a large number of samples will be necessary to make a confident decision. [sent-91, score-0.241]

35 In first order 1 , one estimates the effective sample size as N = N −1 (x(j) −ˆ )(x(j+1) −ˆ ) µ µ 1−r 1 , and j=1 1+r N , where r is the first order autocorrelation coefficient, r = N −1 σ2 2 σ is the estimated variance of the sample sequence. [sent-101, score-0.18]

36 Subsequently, we estimate the confidence of a decision by p(z = 0|S) = I 1 (ˆN + 1, (1 − µ)N + 1), µ ˆ 2 (5) i. [sent-103, score-0.165]

37 2 Adaptive Sampling in Graphical Models In this section we extend the above confidence criterion from single binary decisions to the situation of joint sampling from the joint probability of multiple binary variables. [sent-107, score-0.445]

38 Note that we are only inter(j) ested in per-variable decisions, so we can treat the value of each variable xi in a joint sample x(j) as a separate sample from the marginal probability p(xi ). [sent-108, score-0.391]

39 We will have to take the dependence between (j) (k) different samples xi and xi into account, but between variable dependencies within a sample do not pose problems. [sent-109, score-0.302]

40 Consequently, estimate the confidence of any decision variable zi is straight (1) (N ) forward from Equation (5), applied separately to the binary sample set Si = {xi , . [sent-110, score-0.364]

41 For example, each variable has its own autocorrelation estimate and effective sample size. [sent-115, score-0.178]

42 3 The difference to the binary situation lies in what we do when we are confident enough about the decision of some subset of variables, V c ⊂ V . [sent-117, score-0.362]

43 Simply stopping all sampling would be too risky, since we are still uncertain about the decisions of V u := V \ V c . [sent-118, score-0.389]

44 Continuing to sample until we are certain about all decision will be wasteful, since we know that variables with marginal close to 0. [sent-119, score-0.507]

45 This requires us to derive an expression for p(xu ), the marginal probability of all variables that we are still uncertain about. [sent-122, score-0.319]

46 Computing p(xu ) = x xc ∈{0,1}V c p(¯c , xu ) exactly is almost always infeasible, otherwise, we ¯ would not have needed to resort to sampling based inference in the first place. [sent-123, score-0.55]

47 An alternative idea would be to continue using the original factor graph, but to clamp all variables we are certain about to their MMP values. [sent-124, score-0.238]

48 This is computationally feasible, but it results in samples from a conditional distribution, p(xu |xc = zc ), not from the desired marginal one. [sent-125, score-0.253]

49 The new construction that we introduce combines advantages of both previous ideas: it is computationally as efficient as the value clamping, but it uses a distribution that approximates the marginal distribution as closely as possible. [sent-126, score-0.177]

50 Subsequently, q(xu ) can be used as approximate replacement to p(xu ), because p(xu ) = xc ∈{0,1}V c p(x) ≈ xc ∈{0,1}V c q (¯c )q(xu ) = q(xu ). [sent-128, score-0.278]

51 Because we want q also to respect the marginal probabilities, µi for i ∈ V c , as estimated them from the sampling ˆ process so far, we obtain q (xc ) = i∈V c µxi (1 − µi )xi . [sent-131, score-0.329]

52 Instead, we define it as the solution of minimizing KL(p|qq ) over all distributions q, which yields the solution q(xu ) ∝ exp( −Exc ∼q (xc ) {E(¯c , xu )} ). [sent-133, score-0.205]

53 x ¯ (6) What remains is to define factors F and log-potentials ψ , such that q(xu ) ∝ exp − F ∈F ψF (xF ) while also allowing for efficient sampling from q. [sent-134, score-0.273]

54 Each factor F0 ∈ F0 we split further into c u its certain and uncertain components, F0 ⊂ V c and F0 ⊂ V u , respectively. [sent-136, score-0.184]

55 The third sum we rewrite as u {F u =F ∩V u :F ∈F0 } ψF u (xF ), with ¯ x µxi (1 − µi )1−¯i ψF (¯c , xu ). [sent-139, score-0.205]

56 If factors with identical variable ¯ set occur during this construction, we merge them by summing their log-potentials. [sent-141, score-0.17]

57 Ultimately, we obtain a new factor set F := F u ∪ {F ∩ V u : F ∈ F0 }, and probability distribution q(xu ) ∝ exp ψF (xF ) u for xu ∈ {0, 1}V . [sent-142, score-0.291]

58 (8) F ∈F Note that during the process, not only the number of variables is reduced, but also the number of factors and the size of each factor can never grow. [sent-143, score-0.286]

59 3 Related Work Sequential sampling with the option of early stopping has a long tradition in Bayesian statistics. [sent-145, score-0.273]

60 First introduced by Wald in 1945 [18], the ability to continuously accumulate information until a decision can be made with sufficient confidence was one of the key factors that contributed to the success of 4 Bayesian reasoning for decision making. [sent-146, score-0.451]

61 In current machine learning research, sequential sampling is used less frequently for making individual decisions, but in the form of MCMC it has become one of the most successful techniques for statistical inference of probability distributions with many dependent variables [12, 22]. [sent-154, score-0.317]

62 Nevertheless, to the best of our knowledge, the method we propose is the first one that performs early stopping of subsets of variables in this context. [sent-155, score-0.2]

63 Many other approaches to reduce the complexity of sampling iterations exist, however, for example to approximate complex graphical models by simpler ones, such as trees [23], or loopy models of low treewidth [24]. [sent-156, score-0.318]

64 They dynamically exclude low-likelihood label combinations from the inference process, but they keep the size and topology of the factor graph fixed. [sent-159, score-0.271]

65 Select and sample [26] disregards a data-dependent subset of variables during each sampling iterations. [sent-160, score-0.282]

66 It is not directly applicable in our situation, though, since it requires that the underlying graphical model is bipartite, such that the individual variables are conditionally independent of each other. [sent-161, score-0.193]

67 Given their complementary nature, we believe that the idea of combining adaptive MMP with beam search and/or select and sample could be a promising direction for future work. [sent-162, score-0.251]

68 4 Experimental Evaluation To demonstrate the effect of adaptive MMP compared to naive MMP, we performed experiments in two prototypical applications: multi-label classification and binary image inpainting. [sent-163, score-0.322]

69 Given an input y, each label variable i has a unary factor Fi = {i} with log-linear potential ψi (xi ) = wi , y xi , where wi is a label-specific weight vector that was learned from training data. [sent-177, score-0.287]

70 The resulting conditional joint distribution has the form of a Boltzmann machine, p(x|y) ∝ exp(−Ey (x)), with energy function L K L Ey (x) = i=1 ηi xi + i=1 j=i+1 ηij xi xj in minimal representation, where ηi and ηij depend on y. [sent-180, score-0.163]

71 The necessary gradients are computing using a junction tree algorithms for problems with 20 variables or less, and by Gibbs sampling otherwise. [sent-182, score-0.282]

72 The remaining columns give MMP prediction accuracy of the trained CRF models: Exact computes the exact marginal values by a junction tree, Gibbs and Proposed performs ordinary Gibbs sampling, or the proposed adaptive version with = 10−2 /10−5 /10−8 , both run for up to 500 iterations. [sent-253, score-0.51]

73 1 1 10 100 1000 10000 Figure 1: Results of adaptive pruning on RCV 1 dataset for = 10−2 , 10−5 , 10−8 (left to right). [sent-350, score-0.254]

74 x-axis: regularization parameter C used for training, y-axis: ratio of iterations/variables/factors/ runtime used by adaptive sampling relative to Gibbs sampling. [sent-351, score-0.384]

75 Figures 1 and 2 show in more detail how the adaptive sampling behaves on two exemplary datasets with respect to four aspects: the number of iterations, the number of variables, the number of factors, and the overall runtime. [sent-354, score-0.369]

76 5 in iterations means that the adaptive sample terminated after 250 iterations instead of the maximum of 500, because it was confident about all decisions. [sent-357, score-0.284]

77 2 in variables and factors means that the number of variable states samples by the adaptive sampler was 20%, and the number of factors in the corresponding factor graphs was 10% of the corresponding quantities for the Gibbs sampler. [sent-359, score-0.756]

78 1 1 10 100 1000 10000 Figure 2: Results of adaptive pruning on YEAST dataset for = 10−2 , 10−5 , 10−8 (left to right). [sent-457, score-0.254]

79 x-axis: regularization parameter C used for training, y-axis: ratio of iterations/variables/factors/ runtime used by adaptive sampling relative to Gibbs sampling. [sent-458, score-0.384]

80 As one can see, a large number of variables and factors are removed quickly from the factor graph, leading to a large speedup compared to the ordinary Gibbs sampler. [sent-461, score-0.344]

81 In fact, as the first row shows, it was possible to make a confident decision for all variables far before the 500th iteration, such that the adaptive method terminated early. [sent-462, score-0.403]

82 As a general trend, the weaker the regularization (larger C value in the plot), the earlier the adaptive sampler is able to remove variables and factors, presumably because more extreme values of the energy function result in more marginal probabilities close to 0 or 1. [sent-463, score-0.561]

83 On the hard YEAST dataset (Figure 2) in the majority of cases the adaptive sampling does not terminate early, indicating that some of the variables have marginal probabilities close to 0. [sent-466, score-0.611]

84 Image inpainting has been tackled successfully by grid-shaped Markov random field models, where each pixel is represented by a random variable, unary factors encode local evidence extracted from the image, and pairwise terms encode the cooccurrence of pixel value. [sent-471, score-0.438]

85 Each variable has one unary and up to 64 pairwise factors, leading to an overall factor count of 146224 to 553726. [sent-476, score-0.226]

86 Because many of the pairwise factors act repulsively, the underlying energy function is highly non-submodular, and sampling has proven a more successful mean of inference than, for example, message passing [31]. [sent-477, score-0.402]

87 In each case, we ran an ordinary Gibbs sampler and the adaptive sampler for 30 seconds, 7 input Gibbs = 10−2 = 10−5 = 10−8 Figure 3: Example results of binary image inpainting on HECC dataset. [sent-480, score-0.611]

88 From left to right: image to be inpainted, result of Gibbs sampling, result of adaptive sampling, where each method was run for up to 30 seconds per image. [sent-481, score-0.226]

89 The left plot of each result shows the marginal probabilities, the right plot shows how often each pixel was sampled on a log scale from 10 (dark blue) to 100000 (bright red). [sent-482, score-0.285]

90 Gibbs sampling treats all pixels uniformly, reaching around 100 sampling sweeps within the given time budget. [sent-483, score-0.364]

91 Adaptive sampling stops early for parts of the image that it is certain about, and concentrates its samples in the uncertain regions, i. [sent-484, score-0.482]

92 and we visualize the resulting marginal probabilities as well as the number of samples created for each of the pixels. [sent-489, score-0.297]

93 One can see that adaptive sampling comes to a more confident prediction within the given time budget. [sent-490, score-0.376]

94 The larger the parameter, the earlier to stops sampling the ’easy’ pixels, spending more time on the difficult cases, i. [sent-491, score-0.186]

95 In combination, this allowed us to more efficiently infer the MMPs: starting from the whole factor graph, we sample sequentially, and whenever we are sufficiently certain about a prediction, we prune it from the factor graph before continuing to sample. [sent-496, score-0.402]

96 Experiments on multilabel classification and image inpainting show a clear increase in performance at virtually no loss in accuracy, unless the confidence is chosen too optimistically. [sent-497, score-0.209]

97 This is likely a consequence of the fact that our pruning uses the estimated marginal to build a new factor graph, and even if the decision confidence is high, the marginals can still vary considerately. [sent-500, score-0.589]

98 This is sufficient for multi-label classification and many problems in image processing, but ultimately, one would hope to derive similar early stopping criteria also for graphical models with larger label set. [sent-503, score-0.306]

99 Our pruning method would be readily applicable to this situation, but an open challenge lies in finding a suitable criterion when to prune variables. [sent-504, score-0.219]

100 This will require a deeper understanding of tail probabilities of multinomial decision variables, but we are confident it will be achievable, for example based on existing prior works from the case of i. [sent-505, score-0.209]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmp', 0.496), ('dent', 0.274), ('xu', 0.205), ('marginal', 0.177), ('xf', 0.175), ('decision', 0.165), ('adaptive', 0.159), ('sampling', 0.152), ('dence', 0.152), ('gibbs', 0.152), ('inpainting', 0.142), ('xc', 0.139), ('con', 0.13), ('factors', 0.121), ('decisions', 0.108), ('pruning', 0.095), ('factor', 0.086), ('graphical', 0.082), ('variables', 0.079), ('samples', 0.076), ('situation', 0.075), ('runtime', 0.073), ('image', 0.067), ('lies', 0.067), ('marginals', 0.066), ('stopping', 0.066), ('prediction', 0.065), ('sampler', 0.065), ('beta', 0.064), ('uncertain', 0.063), ('xi', 0.063), ('pixels', 0.06), ('austria', 0.059), ('ordinary', 0.058), ('chl', 0.058), ('ediamill', 0.058), ('exemplary', 0.058), ('hecc', 0.058), ('mmps', 0.058), ('rcv', 0.058), ('ynth', 0.058), ('prune', 0.057), ('early', 0.055), ('binary', 0.055), ('inference', 0.054), ('incomplete', 0.054), ('unary', 0.053), ('exc', 0.051), ('pietra', 0.051), ('tests', 0.051), ('sample', 0.051), ('junction', 0.051), ('variable', 0.049), ('classi', 0.048), ('graph', 0.048), ('della', 0.047), ('pgms', 0.047), ('loopy', 0.047), ('dynamically', 0.047), ('drug', 0.045), ('probabilities', 0.044), ('zi', 0.044), ('yeast', 0.042), ('autocorrelation', 0.042), ('pixel', 0.042), ('hamming', 0.042), ('fj', 0.041), ('fc', 0.041), ('yanover', 0.041), ('lampert', 0.041), ('ess', 0.041), ('prototypical', 0.041), ('beam', 0.041), ('dq', 0.041), ('ultimately', 0.04), ('stop', 0.04), ('consequently', 0.039), ('continuing', 0.039), ('fu', 0.039), ('ij', 0.038), ('nevertheless', 0.038), ('pairwise', 0.038), ('continue', 0.038), ('iterations', 0.037), ('energy', 0.037), ('proportionally', 0.037), ('ey', 0.037), ('effective', 0.036), ('label', 0.036), ('certain', 0.035), ('correlated', 0.035), ('stops', 0.034), ('plot', 0.033), ('monte', 0.032), ('drastically', 0.032), ('crf', 0.032), ('individual', 0.032), ('cv', 0.032), ('labels', 0.031), ('regions', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. 1

2 0.14474438 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

Author: Yung-kyun Noh, Frank Park, Daniel D. Lee

Abstract: This paper sheds light on some fundamental connections of the diffusion decision making model of neuroscience and cognitive psychology with k-nearest neighbor classification. We show that conventional k-nearest neighbor classification can be viewed as a special problem of the diffusion decision model in the asymptotic situation. By applying the optimal strategy associated with the diffusion decision model, an adaptive rule is developed for determining appropriate values of k in knearest neighbor classification. Making use of the sequential probability ratio test (SPRT) and Bayesian analysis, we propose five different criteria for adaptively acquiring nearest neighbors. Experiments with both synthetic and real datasets demonstrate the effectiveness of our classification criteria. 1

3 0.13957424 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1

4 0.12733965 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models. 1

5 0.11773229 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

6 0.11605185 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

7 0.11560854 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

8 0.10031749 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

9 0.095199883 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

10 0.094267823 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

11 0.09392111 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

12 0.093081832 204 nips-2012-MAP Inference in Chains using Column Generation

13 0.092519671 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.091200896 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

15 0.091080278 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

16 0.088115521 126 nips-2012-FastEx: Hash Clustering with Exponential Families

17 0.087755255 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

18 0.084659658 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

19 0.084010638 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

20 0.081162684 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.252), (1, 0.03), (2, -0.027), (3, -0.021), (4, -0.104), (5, -0.064), (6, 0.024), (7, -0.006), (8, -0.053), (9, 0.0), (10, -0.04), (11, -0.004), (12, 0.056), (13, 0.031), (14, -0.074), (15, -0.078), (16, -0.062), (17, -0.135), (18, 0.007), (19, 0.029), (20, 0.116), (21, -0.015), (22, -0.112), (23, 0.076), (24, -0.039), (25, -0.028), (26, 0.015), (27, -0.012), (28, 0.011), (29, 0.03), (30, -0.038), (31, 0.005), (32, -0.042), (33, -0.011), (34, 0.0), (35, -0.03), (36, -0.031), (37, 0.049), (38, 0.056), (39, -0.047), (40, -0.055), (41, -0.133), (42, -0.025), (43, -0.005), (44, 0.086), (45, 0.092), (46, -0.012), (47, 0.036), (48, -0.032), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96408409 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. 1

2 0.79076993 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

Author: Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, Charles A. Sutton

Abstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants (partition functions), and in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. 1

3 0.76117724 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

Author: Lucas Theis, Jascha Sohl-dickstein, Matthias Bethge

Abstract: We present a new learning strategy based on an efficient blocked Gibbs sampler for sparse overcomplete linear models. Particular emphasis is placed on statistical image modeling, where overcomplete models have played an important role in discovering sparse representations. Our Gibbs sampler is faster than general purpose sampling schemes while also requiring no tuning as it is free of parameters. Using the Gibbs sampler and a persistent variant of expectation maximization, we are able to extract highly sparse distributions over latent sources from data. When applied to natural images, our algorithm learns source distributions which resemble spike-and-slab distributions. We evaluate the likelihood and quantitatively compare the performance of the overcomplete linear model to its complete counterpart as well as a product of experts model, which represents another overcomplete generalization of the complete linear model. In contrast to previous claims, we find that overcomplete representations lead to significant improvements, but that the overcomplete linear model still underperforms other models. 1

4 0.66255659 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

Author: Gal Elidan, Cobi Cario

Abstract: The empirical success of the belief propagation approximate inference algorithm has inspired numerous theoretical and algorithmic advances. Yet, for continuous non-Gaussian domains performing belief propagation remains a challenging task: recent innovations such as nonparametric or kernel belief propagation, while useful, come with a substantial computational cost and offer little theoretical guarantees, even for tree structured models. In this work we present Nonparanormal BP for performing efficient inference on distributions parameterized by a Gaussian copulas network and any univariate marginals. For tree structured networks, our approach is guaranteed to be exact for this powerful class of non-Gaussian models. Importantly, the method is as efficient as standard Gaussian BP, and its convergence properties do not depend on the complexity of the univariate marginals, even when a nonparametric representation is used. 1

5 0.64465499 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

Author: Stefano Ermon, Ashish Sabharwal, Bart Selman, Carla P. Gomes

Abstract: Given a probabilistic graphical model, its density of states is a distribution that, for any likelihood value, gives the number of configurations with that probability. We introduce a novel message-passing algorithm called Density Propagation (DP) for estimating this distribution. We show that DP is exact for tree-structured graphical models and is, in general, a strict generalization of both sum-product and max-product algorithms. Further, we use density of states and tree decomposition to introduce a new family of upper and lower bounds on the partition function. For any tree decomposition, the new upper bound based on finer-grained density of state information is provably at least as tight as previously known bounds based on convexity of the log-partition function, and strictly stronger if a general condition holds. We conclude with empirical evidence of improvement over convex relaxations and mean-field based bounds. 1

6 0.62242973 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

7 0.61059308 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

8 0.60121238 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

9 0.59154648 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

10 0.58968145 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

11 0.57955432 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

12 0.56607735 8 nips-2012-A Generative Model for Parts-based Object Segmentation

13 0.56297714 279 nips-2012-Projection Retrieval for Classification

14 0.55880594 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

15 0.55568165 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

16 0.55367512 204 nips-2012-MAP Inference in Chains using Column Generation

17 0.55360609 359 nips-2012-Variational Inference for Crowdsourcing

18 0.54476869 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

19 0.54364467 205 nips-2012-MCMC for continuous-time discrete-state systems

20 0.54223454 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (17, 0.014), (21, 0.335), (38, 0.126), (39, 0.013), (42, 0.014), (54, 0.016), (55, 0.013), (74, 0.06), (76, 0.141), (80, 0.118), (92, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94495732 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a specific pattern of recurrent connections learned by the model. 1

2 0.91133505 1 nips-2012-3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model

Author: Sanja Fidler, Sven Dickinson, Raquel Urtasun

Abstract: This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model [1] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach significantly outperforms the stateof-the-art in both 2D [1] and 3D object detection [2]. 1

3 0.86390013 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is first to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1

4 0.85605145 60 nips-2012-Bayesian nonparametric models for ranked data

Author: Francois Caron, Yee W. Teh

Abstract: We develop a Bayesian nonparametric extension of the popular Plackett-Luce choice model that can handle an infinite number of choice items. Our framework is based on the theory of random atomic measures, with the prior specified by a gamma process. We derive a posterior characterization and a simple and effective Gibbs sampler for posterior simulation. We develop a time-varying extension of our model, and apply it to the New York Times lists of weekly bestselling books. 1

same-paper 5 0.83875936 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

Author: Christoph H. Lampert

Abstract: We study the problem of maximum marginal prediction (MMP) in probabilistic graphical models, a task that occurs, for example, as the Bayes optimal decision rule under a Hamming loss. MMP is typically performed as a two-stage procedure: one estimates each variable’s marginal probability and then forms a prediction from the states of maximal probability. In this work we propose a simple yet effective technique for accelerating MMP when inference is sampling-based: instead of the above two-stage procedure we directly estimate the posterior probability of each decision variable. This allows us to identify the point of time when we are sufficiently certain about any individual decision. Whenever this is the case, we dynamically prune the variables we are confident about from the underlying factor graph. Consequently, at any time only samples of variables whose decision is still uncertain need to be created. Experiments in two prototypical scenarios, multi-label classification and image inpainting, show that adaptive sampling can drastically accelerate MMP without sacrificing prediction accuracy. 1

6 0.79937607 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

7 0.79601252 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests

8 0.73705918 23 nips-2012-A lattice filter model of the visual pathway

9 0.734097 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release

10 0.72107971 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

11 0.69866842 190 nips-2012-Learning optimal spike-based representations

12 0.6901772 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

13 0.68276668 94 nips-2012-Delay Compensation with Dynamical Synapses

14 0.68153667 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

15 0.67537278 152 nips-2012-Homeostatic plasticity in Bayesian spiking networks as Expectation Maximization with posterior constraints

16 0.67325598 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

17 0.6727072 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects

18 0.66271859 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

19 0.66262537 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

20 0.66255587 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter