nips nips2012 nips2012-143 knowledge-graph by maker-knowledge-mining

143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins


Source: pdf

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. [sent-11, score-0.942]

2 Many of these designed solvers consider the dual program, thus they are based on local updates that follow the graphical model structure, which ensures suitability for very large problems. [sent-23, score-0.653]

3 Unfortunately, the dual program is non-smooth, hence introducing difficulties to existing solvers. [sent-24, score-0.679]

4 For example, block coordinate descent algorithms, typically referred to as convex max-product, monotonically decrease the dual objective and converge very fast, but are not guaranteed to reach the global optimum of the dual program [3, 6, 11, 14, 17, 20, 22, 24, 25]. [sent-25, score-1.877]

5 Different approaches to overcome the sub-optimality of the convex max-product introduced different perturbed programs for which convergence to the dual optimum is guaranteed, e. [sent-26, score-0.773]

6 In this work we propose to augment the convex max-product algorithm with a steepest -descent approach to monotonically decrease the dual objective until reaching the global optimum of the dual program. [sent-30, score-1.587]

7 To perform the -descent we explore the -subgradients of the dual program, and provide a method to search for a descent direction in the -subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. [sent-31, score-0.868]

8 This characterization also provides a new algorithm to 1 construct a primal optimal solution for the LP relaxation from a dual optimal solution. [sent-32, score-0.987]

9 We begin by introducing the notation, MAP LP relaxations and their dual programs. [sent-35, score-0.622]

10 We subsequently describe the subgradients of the dual and provide an efficient procedure to recover a primal optimal solution. [sent-36, score-1.01]

11 We explore the -subgradients of the dual objective, and introduce an efficient globally convergent dual solver based on the -margin of the Fenchel-Young duality theorem. [sent-37, score-1.278]

12 Its tractable relaxation is obtained by replacing the integral constraints with non-negativity constraints as follows: max bi ,bα bα (xα )θα (xα ) + α,xα bi (xi )θi (xi ) (2) i,xi s. [sent-61, score-0.442]

13 bi (xi ), bα (xα ) ≥ 0, bα (xα ) = 1, xα bi (xi ) = 1, xi bα (xα ) = bi (xi ). [sent-63, score-0.636]

14 , the optimal beliefs satisfy bi (xi ), bα (xα ) ∈ {0, 1}, the program value equals the MAP value. [sent-66, score-0.508]

15 Following [22, 27] we consider the re-parametrized dual         q(λ) = max θi (xi ) + λi→α (xi ) + max θα (xα ) − λi→α (xi ) . [sent-69, score-0.604]

16 (3) xi  xα    α i α∈N (i) i∈N (α) The dual program value upper bounds the primal program described in Eq. [sent-70, score-1.227]

17 Therefore to compute the primal optimal value one can minimize the dual upper bound. [sent-72, score-0.854]

18 Using block coordinate descent on the dual objective amounts to optimizing blocks of dual variables while holding the remaining ones fixed. [sent-73, score-1.364]

19 php 2 The convex max-product algorithm is guaranteed to converge since it minimizes the dual function, which is lower bounded by the primal program. [sent-82, score-0.942]

20 This happens since the dual objective is non-smooth, thus the algorithm can reach a corner, for which the dual objective stays fixed when changing only a few variables. [sent-86, score-1.22]

21 The second drawback of convex max-product is that it does not always produce a primal optimal solution, bi (xi ), bα (xα ), even when it reaches a dual optimal solution. [sent-89, score-1.183]

22 In the next section, we consider the dual subgradients, and provide an efficient algorithm for detecting corners, as well as for decoding a primal optimal solution from a dual optimal solution. [sent-90, score-1.459]

23 It provides an efficient way to get out of corners, and to reach the optimal dual value. [sent-93, score-0.674]

24 (4) The supporting hyperplane at (λ, q(λ)) with slope d takes the form d λ − q ∗ (d), when defining the conjugate dual as q ∗ (d) = maxλ {d λ − q(λ)}. [sent-100, score-0.568]

25 (4) we can verify that λ is dual optimal if and only if 0 ∈ ∂q(λ). [sent-105, score-0.605]

26 In the following claim we characterize the subdifferential of the dual function q(λ) using the Fenchel-Young duality theorem: ∗ Claim 1. [sent-106, score-0.797]

27 The claim above implies that the convex max-product iterates over i and generates beliefs bi (xi ), bα (xα ) for every xi , α ∈ N (i) that agree on their marginal probabilities. [sent-116, score-0.655]

28 The representation of the subdifferential as the amount of disagreement between the marginalization constraints provides a simple procedure to verify dual optimality, as well as to construct primal optimal solutions. [sent-123, score-1.044]

29 Consider the quadratic program bα (x∗ ) − bi (x∗ ) α i min bi ,bα i,x∗ ,α∈N (i) i 2 x∗ \x∗ α i s. [sent-127, score-0.453]

30 i bα (x∗ ) = 1, α x∗ α x∗ i λ is a dual optimal solution if and only if the value of the above program equals zero. [sent-130, score-0.779]

31 Moreover, if λ is a dual optimal solution, then the optimal beliefs b∗ (xα ), b∗ (xi ) are also the optimal solution α i 3 of the primal program in Eq. [sent-131, score-1.23]

32 However, if λ is not dual optimal, then the vector d∗ (xi ) = i→α ∗ ∗ ∗ ∗ x∗ \x∗ bα (xα ) − bi (xi ) points towards the steepest descent direction of the dual function, i. [sent-133, score-1.628]

33 α Proof: The steepest descent direction d of q is given by minimizing the directional derivative qd , min qd (λ) = min max d y = max min d y = max − y 2 , d ≤1 d ≤1 y∈∂q y∈∂q d ≤1 y∈∂q which yields the above program (cf . [sent-136, score-0.676]

34 If the zero vector is part of the subdifferential, we are dual optimal. [sent-138, score-0.542]

35 One can monotonically decrease the dual objective by minimizing it along the steepest descent direction. [sent-140, score-1.018]

36 Unfortunately, following the steepest descent direction does not guarantee convergence to the global minimum of the dual function [28]. [sent-141, score-0.928]

37 Performing steepest descent might keep minimizing the dual objective with smaller and smaller increments, thus converging to a suboptimal solution. [sent-142, score-0.894]

38 The main drawback of steepest descent as well as block coordinate descent when applied to the dual ∗ ∗ objective in Eq. [sent-143, score-1.128]

39 In the following we show that by considering the -margin of these supports we can guarantee that at every iteration we decrease the dual value by at least . [sent-145, score-0.601]

40 This procedure results in an efficient algorithm that reaches both dual and primal optimal solutions. [sent-146, score-0.854]

41 4 The -Subgradients of the Dual Objective and Steepest -Descent To monotonically decrease the dual value while converging to the optimum, we suggest to explore the -neighborhood of the dual objective in Eq. [sent-147, score-1.282]

42 Given our convex dual function q(λ) and a positive scalar , we say that a vector d is an -subgradient at λ if it supports the epigraph of q(λ) with an -margin: ˆ ∀λ ˆ ˆ q(λ) − d λ ≥ q(λ) − d λ − . [sent-150, score-0.71]

43 Using the conjugate dual q ∗ (d), we can characterize the -subdifferential by employing the -margin Fenchel-Young duality theorem. [sent-153, score-0.675]

44 Whenever one finds a steepest descent direction within ∂ q(λ), it is guaranteed to improve the dual objective by at least . [sent-155, score-1.017]

45 Moreover, if one cannot find such a direction within the -subdifferential, then q(λ) is guaranteed to be -close to the dual optimum. [sent-156, score-0.665]

46 On the one hand, if 0 ∈ ∂ q(λ) then the direction of ˜ ˜ steepest descent taken from ∂ q(λ) reduces the dual objective by at least . [sent-173, score-0.974]

47 If 0 ∈ ∂ q(λ) then q(λ) is m -close to the dual optimum. [sent-174, score-0.542]

48 (6) to characterize the approximated -subdifferential of the dual function. [sent-176, score-0.582]

49 , d ∈ ∂ q(λ) if and only if di→α (xi ) = xα \xi bα (xα ) − bi (xi ) for probability distributions bi (xi ), bα (xα ) that satisfy       λi→α (xi ) − ≤ bi (xi ) θi (xi ) − λi→α (xi ) ∀i max θi (xi ) − xi   xi α∈N (i) α∈N (i)       λi→α (xi ) − ≤ bα (xα ) θα (xα ) + λi→α (xi ) . [sent-182, score-0.829]

50 (6) implies b ∈ ∂ qr (θ) if and only if qr (θ) + qr (b) − b θ ≤ with qr (b) denoting the ∗ ˆ conjugate dual of qr (θ). [sent-184, score-1.153]

51 Plugging in qr , qr we obtain not only the maximizing beliefs but all beliefs with an -margin. [sent-185, score-0.46]

52 Claim 2 ensures that minimizing the dual objective along a direction of descent decreases its value by at least . [sent-188, score-0.806]

53 Moreover, we are guaranteed to be ˜ (|V |+|E|) -close to a dual optimal solution if no direction of descent is found in ∂ q(λ). [sent-189, score-0.866]

54 Therefore, we are able to get out of corners and efficiently reach an approximated dual optimal solution. [sent-190, score-0.776]

55 The interpretation of the Fenchel-Young margin as the amount of disagreement between the marginalization constraints also provides a simple way to reconstruct an approximately optimal primal solution. [sent-191, score-0.391]

56 Consider the quadratic program 2 bα (xα ) − bi (xi ) min bi ,bα i,xi ,α∈N (i) xα \xi s. [sent-195, score-0.453]

57 bi (xi ), bα (xα ) ≥ 0, bα (xα ) = 1, xα bi (xi ) = 1 xi ˆ ˆ bi (xi )θi (xi ) ≥ max{θi (xi )} − , xi xi ˆ ˆ bα (xα )θα (xα ) ≥ max{θα (xα )} − . [sent-197, score-0.96]

58 xα xα q(λ) is (|V | + |E|) -close to the dual optimal value if and only if the value of the above program equals zero. [sent-198, score-0.779]

59 Moreover, the optimal beliefs b∗ (xα ), b∗ (xi ) primal value is (|V | + |E|) -close to the α i optimal primal value in Eq. [sent-199, score-0.737]

60 However, if q(λ) is not (|V | + |E|) -close to the dual optimal value then the vector d∗ (xi ) = xα \xi b∗ (xα )−b∗ (xi ) points towards the steepest -descent direction α i→α i of the function, namely q(λ + αd) − q(λ) + d∗ = argmin lim . [sent-201, score-0.853]

61 (|V | + |E|) -closeness to the dual optimum is given by ([2], Proposition 4. [sent-203, score-0.599]

62 i,xi α,xα xi i α xα where the primal on the left hand side of the resulting inequality is larger then the dual subtracted by (|V | + |E|) . [sent-212, score-0.953]

63 With the dual itself upper bounding the primal, the corollary follows. [sent-213, score-0.603]

64 Thus, we can construct an algorithm that performs improvements over the dual function in each iteration. [sent-214, score-0.567]

65 Since both methods monotonically improve the same dual function, our approach is guaranteed to reach the optimal dual solution and to recover the primal optimal solution. [sent-218, score-1.689]

66 5 (a) (b) Figure 1: (a) Difference between the minimal dual value attained by convex max-product q(λCMP ) and our approach q(λ ). [sent-219, score-0.65]

67 br (xr ) ≥ 0, ∀r, s ∈ P (r) br (xr ) = 1, xr bs (xs ) = br (xr ) xs \xr Following [5, 22, 27] we consider the re-parametrized dual program q(λ) = λc→r (xc ) − max θr (xr ) + r xr c∈C(r) λr→p (xr ) p∈P (r) which is a sum of max-functions. [sent-236, score-1.746]

68 4 we present a simple way to recover an -steepest descent direction, as well as to reconstruct an approximated optimal primal solution. [sent-239, score-0.541]

69 xp \xr br (xr ) ≥ 0, ˆ ˆ br (xr )θr (xr ) ≥ max{θr (xr )} − br (xr ) = 1, xr xr xr Let |R| be the total number of regions in the graph, then λ is |R| -close to the dual optimal solution if and only if the value of the above program equals zero. [sent-244, score-2.293]

70 Moreover, the optimal beliefs b∗ (xr ) are r also |R| -close to the optimal solution of the primal program in Eq. [sent-245, score-0.625]

71 However, if q(λ) is not |R| close to the dual optimal solution then the vector d∗ (xr ) = xp \xr b∗ (xp ) − b∗ (xr ) points r→p p r towards the steepest -descent direction of the dual function. [sent-247, score-1.436]

72 6 time [s] time [s] (a) (b) (c) Figure 2: Average time required for different solvers to achieve a specified accuracy on 30 spin glass models, (a) when solvers are applied to “hard” problems only, i. [sent-253, score-0.436]

73 Average results over 30 models are shown in (b), (c) decrease of the dual value over time for ADLP and our -descent approach. [sent-256, score-0.574]

74 6 Experimental Evaluation To benefit from the efficiency of convex max-product, our implementation starts by employing block-coordinate descent iterations before switching to the globally convergent -descent approach once the dual value decreases by less than = 0. [sent-257, score-0.901]

75 We use three states as convex max-product is optimal for pairwise spin glass models with only two states per random variable. [sent-263, score-0.483]

76 We generate a set of 1000 spin glass models and estimate the distribution of the dual value difference comparing the -descent approach with the convex max-product result after 10, 000 iterations. [sent-267, score-0.962]

77 1(a) that about 20% of the spin glass models have a dual value difference larger than zero. [sent-269, score-0.854]

78 We compare our implementation of the -steepest descent algorithm with the alternating direction method for dual MAPLP relaxations (ADLP) [18]. [sent-271, score-0.84]

79 1(b), where we evaluate a single spin glass model and illustrate the dual value obtained after a certain amount of time. [sent-276, score-0.854]

80 We first focus on “hard” problems, where we defined “hard” as those spin glass models whose difference between convex max-product and the -descent method is larger than 0. [sent-280, score-0.42]

81 We used the minimum across all dual values found by all algorithms as the optimum. [sent-284, score-0.542]

82 The dual energy obtained after a given amount of time is illustrated in Fig. [sent-296, score-0.577]

83 7 7 Related Work We explore methods to solve LP relaxations by monotonically decreasing the value of its dual objective and reconstructing a primal optimal solution. [sent-298, score-1.136]

84 For this purpose we investigate approximated subgradients of the dual program using the Fenchel-Young margins, and provide a method to reduce the dual objective in every step by a constant value until convergence to the optimum. [sent-299, score-1.412]

85 Efficient dual solvers were extensively studied in the context of LP relaxations for the MAP problem [14, 20, 25]. [sent-300, score-0.684]

86 The dual program is non-smooth, thus subgradient descent algorithms are guaranteed to reach the dual optimum [12], as well as recover the primal optimum [12]. [sent-301, score-1.891]

87 Dual block coordinate descent methods, typically referred to as convex max-product algorithms, are monotonically decreasing, and were shown to be faster than subgradient methods [3, 6, 11, 17, 22, 24, 27]. [sent-303, score-0.465]

88 Since the dual program is non-smooth, these algorithms can get stuck in non-optimal stationary points and cannot in general recover a primal optimal solution [26]. [sent-304, score-1.196]

89 Some methods use the soft-max with low temperature to smooth the dual objective in order to avoid corners as well as to recover primal optimal solutions [6, 7, 8]. [sent-308, score-1.013]

90 [19] applied the proximal method, employing a primal strictly concave perturbation, which results in a smooth dual approximation that is temperature independent. [sent-310, score-0.844]

91 This approach converges to the dual optimum and recovers the primal optimal solution. [sent-311, score-0.911]

92 Alternative methods applied augmented Lagrangian techniques to the primal [16] and the dual programs [18]. [sent-313, score-0.846]

93 The augmented Lagrangian method guarantees to reach the global optimum and recover the dual and primal solutions. [sent-314, score-0.971]

94 Unlike our approach, this method is not monotonically decreasing and works on a perturbed objective, thus cannot be efficiently integrated with convex max-product updates that perform block coordinate descent on the dual of the LP relaxation. [sent-315, score-1.051]

95 We use the -margin of the Fenchel-Young duality theorem to adjust the -subdifferential to the dual objective of the LP relaxation, thus augmenting the convex max-product with the ability to get out of corners. [sent-317, score-0.801]

96 However, our algorithm is monotonically decreasing and can be used for general graphical models, while the auction algorithm might increase its dual and its convergence properties hold only for network flow problems. [sent-322, score-0.752]

97 Mainly, these solvers can get stuck in a non-optimal stationary point, thus they cannot recover the primal optimal solution. [sent-325, score-0.579]

98 We explore the properties of subgradients of the dual objective and construct a simple algorithm that determines if the dual stationary point is optimal and recovers the primal optimal solution in this case (Corollary 1). [sent-326, score-1.693]

99 Moreover, we investigate the family of -subgradients using Fenchel-Young margins and construct a monotonically decreasing algorithm that is guaranteed to achieve optimal dual and primal solutions (Corollary 2), including general region graphs (Corollary 3). [sent-327, score-1.127]

100 The approximated steepest descent direction is recovered by solving a quadratic program subject to linear constraints. [sent-329, score-0.563]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dual', 0.542), ('xr', 0.398), ('primal', 0.249), ('steepest', 0.168), ('xi', 0.162), ('spin', 0.162), ('bi', 0.158), ('glass', 0.15), ('descent', 0.138), ('program', 0.137), ('adlp', 0.13), ('qr', 0.117), ('beliefs', 0.113), ('cmp', 0.112), ('convex', 0.108), ('lp', 0.108), ('subgradients', 0.105), ('stuck', 0.099), ('monotonically', 0.092), ('claim', 0.089), ('subdifferential', 0.086), ('br', 0.08), ('direction', 0.08), ('duality', 0.08), ('relaxations', 0.08), ('ddg', 0.074), ('map', 0.069), ('optimal', 0.063), ('corners', 0.062), ('solvers', 0.062), ('corollary', 0.061), ('optimum', 0.057), ('convergent', 0.056), ('ddr', 0.056), ('pic', 0.056), ('marginalization', 0.054), ('recover', 0.051), ('lagrangian', 0.049), ('graphical', 0.049), ('block', 0.049), ('coordinate', 0.047), ('objective', 0.046), ('region', 0.046), ('relaxation', 0.045), ('reach', 0.044), ('guaranteed', 0.043), ('xp', 0.041), ('protein', 0.04), ('approximated', 0.04), ('perturbed', 0.039), ('regions', 0.039), ('disagreements', 0.039), ('message', 0.038), ('equals', 0.037), ('tamir', 0.037), ('hazan', 0.037), ('decreasing', 0.036), ('energy', 0.035), ('gets', 0.035), ('interactions', 0.034), ('globerson', 0.034), ('pollefeys', 0.033), ('auction', 0.033), ('stair', 0.033), ('epigraph', 0.033), ('tightening', 0.033), ('decrease', 0.032), ('propagation', 0.032), ('subgradient', 0.031), ('assignment', 0.031), ('margins', 0.031), ('max', 0.031), ('stationary', 0.03), ('schwing', 0.03), ('tti', 0.03), ('qd', 0.03), ('globally', 0.03), ('meltzer', 0.028), ('explore', 0.028), ('augmented', 0.028), ('passing', 0.028), ('graph', 0.027), ('belief', 0.027), ('zurich', 0.027), ('employing', 0.027), ('supports', 0.027), ('programs', 0.027), ('jaakkola', 0.027), ('conjugate', 0.026), ('nondifferentiable', 0.026), ('bundle', 0.026), ('noticing', 0.026), ('eth', 0.026), ('proximal', 0.026), ('iterates', 0.025), ('construct', 0.025), ('constraints', 0.025), ('get', 0.025), ('uai', 0.025), ('pami', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

2 0.40418336 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

3 0.22518353 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.17657216 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Aaron Defazio, Tibério S. Caetano

Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

5 0.16320577 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

6 0.10216817 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

7 0.10096862 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

8 0.10011719 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

9 0.088307686 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

10 0.088227861 324 nips-2012-Stochastic Gradient Descent with Only One Projection

11 0.086975522 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

12 0.08627481 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

13 0.078242749 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

14 0.076266393 282 nips-2012-Proximal Newton-type methods for convex optimization

15 0.072710901 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

16 0.072411343 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

17 0.070752293 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

18 0.07045988 69 nips-2012-Clustering Sparse Graphs

19 0.068089508 86 nips-2012-Convex Multi-view Subspace Learning

20 0.067272425 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.188), (1, 0.03), (2, 0.129), (3, -0.112), (4, 0.062), (5, 0.083), (6, -0.04), (7, -0.147), (8, -0.047), (9, 0.009), (10, -0.004), (11, 0.024), (12, 0.07), (13, 0.177), (14, 0.086), (15, 0.091), (16, -0.227), (17, -0.156), (18, 0.006), (19, 0.113), (20, 0.195), (21, -0.01), (22, 0.027), (23, -0.001), (24, -0.137), (25, -0.019), (26, 0.056), (27, -0.014), (28, 0.135), (29, 0.2), (30, 0.12), (31, -0.03), (32, -0.046), (33, 0.047), (34, -0.007), (35, -0.098), (36, -0.056), (37, 0.019), (38, 0.018), (39, -0.0), (40, -0.045), (41, 0.077), (42, 0.01), (43, 0.105), (44, 0.003), (45, 0.059), (46, 0.042), (47, -0.002), (48, -0.035), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98356539 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

2 0.95233256 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

3 0.8832283 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.55236179 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

5 0.55124366 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Abstract: Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, finding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints. 1

6 0.50668371 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

7 0.50360161 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

8 0.49511465 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

9 0.47235024 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

10 0.43894267 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

11 0.43401206 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

12 0.42702007 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

13 0.42562783 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

14 0.41299662 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

15 0.41299421 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

16 0.40012014 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

17 0.39870304 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

18 0.39860165 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

19 0.39770788 267 nips-2012-Perceptron Learning of SAT

20 0.39650762 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (21, 0.014), (38, 0.27), (42, 0.032), (54, 0.029), (55, 0.014), (68, 0.238), (74, 0.047), (76, 0.092), (80, 0.083), (92, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87049609 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

2 0.80242127 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

Author: Mert Pilanci, Laurent E. Ghaoui, Venkat Chandrasekaran

Abstract: We consider the problem of cardinality penalized optimization of a convex function over the probability simplex with additional convex constraints. The classical 1 regularizer fails to promote sparsity on the probability simplex since 1 norm on the probability simplex is trivially constant. We propose a direct relaxation of the minimum cardinality problem and show that it can be efficiently solved using convex programming. As a first application we consider recovering a sparse probability measure given moment constraints, in which our formulation becomes linear programming, hence can be solved very efficiently. A sufficient condition for exact recovery of the minimum cardinality solution is derived for arbitrary affine constraints. We then develop a penalized version for the noisy setting which can be solved using second order cone programs. The proposed method outperforms known rescaling heuristics based on 1 norm. As a second application we consider convex clustering using a sparse Gaussian mixture and compare our results with the well known soft k-means algorithm. 1

3 0.77501178 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1

4 0.77381992 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams

Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1

5 0.77212298 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1

6 0.7718547 165 nips-2012-Iterative ranking from pair-wise comparisons

7 0.77055144 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

8 0.76637822 208 nips-2012-Matrix reconstruction with the local max norm

9 0.76370537 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

10 0.76369667 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.7636289 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

12 0.75976944 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

13 0.75645691 86 nips-2012-Convex Multi-view Subspace Learning

14 0.75517935 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

15 0.75375444 227 nips-2012-Multiclass Learning with Simplex Coding

16 0.75141984 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

17 0.75034857 285 nips-2012-Query Complexity of Derivative-Free Optimization

18 0.75033498 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

19 0.75018746 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

20 0.74732995 187 nips-2012-Learning curves for multi-task Gaussian process regression