nips nips2012 nips2012-84 knowledge-graph by maker-knowledge-mining

84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms


Source: pdf

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. [sent-12, score-0.712]

2 One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. [sent-14, score-0.429]

3 Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. [sent-16, score-0.865]

4 We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. [sent-17, score-1.293]

5 One class of particularly simple algorithms, which we will focus on here, are coordinate minimization based approaches. [sent-35, score-0.414]

6 These work by first taking the dual of the LP and then optimizing the dual in a block coordinate fashion [21]. [sent-37, score-0.98]

7 In many cases, the coordinate block operations can be performed in closed form, resulting in updates quite similar to the maxproduct message passing algorithm. [sent-38, score-0.528]

8 This is different from a coordinate descent strategy where instead a gradient step is performed on the chosen coordinates (rather than full optimization). [sent-40, score-0.429]

9 A main caveat of the coordinate minimization approach is that it will not necessarily find the global optimum of the LP (although in practice it often does). [sent-41, score-0.472]

10 Several authors have proposed to smooth the LP with entropy terms and employ variants of coordinate minimization [7, 26]. [sent-43, score-0.45]

11 Moreover, since the algorithms work in the dual, there is no simple procedure to map the result back into primal feasible variables. [sent-45, score-0.593]

12 We seek to address all these shortcomings: we present a convergence rate analysis of dual coordinate minimization algorithms, provide a simple scheme for generating primal variables from dual ones, and analyze the resulting primal convergence rates. [sent-46, score-2.121]

13 Convergence rates for coordinate minimization are not common in the literature. [sent-47, score-0.497]

14 On the other hand, for coordinate descent methods, some rates have been recently obtained for greedy and stochastic update schemes [16, 20]. [sent-50, score-0.722]

15 These do not apply directly to the full coordinate minimization case which we study. [sent-51, score-0.414]

16 Specifically, ML = µ≥0: µc (xc ) = µi (xi ) ∀c, i ∈ c, xi µi (xi ) = 1 ∀i xi xc\i (3) If the maximizer of P M AP has only integral values (i. [sent-71, score-0.291]

17 Given a smoothing parameter τ > 0, we consider the following smoothed primal problem: P M APτ : 1 2 max Pτ (µ) = max µ∈ML µ∈ML µ·θ+ 1 τ H(µc ) + c 1 τ H(µi ) i Although singleton factors are not needed for generality, we keep them for notational convenience. [sent-81, score-0.591]

18 Note that as τ → ∞ we obtain the original primal 1 problem. [sent-84, score-0.427]

19 We shall be particularly interested in the dual of P M APτ since it facilitates simple coordinate minimization updates. [sent-97, score-0.772]

20 Our dual variables will be denoted by δci (xi ), which can be interpreted as the messages from subset c to node i about the value of variable xi . [sent-98, score-0.462]

21 The dual variables are therefore indexed by (c, i, xi ) and written as δci (xi ). [sent-99, score-0.462]

22 Finally, we shall be interested in transformations between dual variables δ and primal variables µ (see Section 5). [sent-101, score-0.873]

23 , they can be used to switch from optimal dual variables to optimal primal variables). [sent-104, score-0.767]

24 For the dual variables δ that minimize F (δ) it holds that µ(δ) are feasible (i. [sent-106, score-0.424]

25 However, we will also consider µ(δ) for non optimal δ, and show how to obtain primal feasible approximations from µ(δ). [sent-109, score-0.534]

26 These will be helpful in obtaining primal convergence rates. [sent-110, score-0.494]

27 We shall make repeated use of this fact to link primal and dual variables. [sent-116, score-0.759]

28 3 Coordinate Minimization Algorithms In this section we propose several coordinate minimization procedures for solving DM APτ (Eq. [sent-117, score-0.414]

29 We first set some notation to define block coordinate minimization algorithms. [sent-119, score-0.583]

30 Block coordinate minimization algorithms work as follows: at each iteration, first set δ t+1 = δ t . [sent-132, score-0.414]

31 Next choose a block Si and set: t+1 δSi = arg min Fi (δSi ; δ t ) (9) δSi where we use Fi (δSi ; δ t ) to denote the function F restricted to the variables δSi and where all other variables are set to their value in δ t . [sent-133, score-0.283]

32 The t greedy scheme is to choose Si that maximizes Si F (δ ) ∞ . [sent-145, score-0.192]

33 Note that to find the best coordinate we presumably must process all sets Si to find the best one. [sent-148, score-0.245]

34 Another consideration when designing coordinate minimization algorithms is the choice of block size. [sent-154, score-0.583]

35 This is the block chosen in the max-sum-diffusion (MSD) algorithm (see [25] and [26] for non-smooth and smooth MSD). [sent-156, score-0.205]

36 A larger block that also facilitates closed form updates is the set of variables δ·i (·). [sent-157, score-0.281]

37 The update is used in [13] for the non-smoothed dual (but the possibility of applying it to the smoothed version is mentioned). [sent-160, score-0.425]

38 To derive the star update around variable i, one needs to fix all variables except δ·i (·) and then set the latter to minimize F (δ). [sent-162, score-0.258]

39 4 Dual Convergence Rate Analysis We begin with the convergence rates of the dual F using greedy and random schemes described in Section 3. [sent-169, score-0.644]

40 In Section 5 we subsequently show how to obtain a primal feasible solution and how the dual rates give rise to primal rates. [sent-170, score-1.336]

41 Our analysis builds on the fact that we can lower bound the improvement at each step, as a function of some norm of the block gradient. [sent-171, score-0.209]

42 Define B1 to be a constant such that δ t − δ ∗ minimization of each block Si satisfies: F (δ t ) − F (δ t+1 ) ≥ for all t, then for any > 0 after T = 3 4 2 kB1 1 k Si F (δ t ) 1 ≤ B1 for all t. [sent-175, score-0.338]

43 If coordinate 2 ∞ (11) iterations of the greedy algorithm, F (δ T ) − F (δ ∗ ) ≤ . [sent-176, score-0.438]

44 If coordinate minimization of each block Si satisfies: 1 t 2 (15) F (δ t ) − F (δ t+1 ) ≥ Si F (δ ) 2 k for all t, then for any > 0 after T = E[F (δ T )] − F (δ ∗ ) ≤ . [sent-187, score-0.583]

45 Note that since the cost of the update is roughly linear in the size of the block then this bound does not tell us which block size is better (the cost of an update times the number of blocks is roughly constant). [sent-191, score-0.494]

46 3 We can now obtain rates for our coordinate minimization scheme for optimizing DM APτ by finding the k to be used in conditions Eq. [sent-193, score-0.522]

47 The star update for xi satisfies the conditions in Eqs. [sent-199, score-0.323]

48 We can then use the fact that the Lipschitz constant of a star block is at most 2τ Ni (this can be calculated as in [18]) to obtain the result. [sent-205, score-0.306]

49 First, we see that both stochastic and greedy schemes achieve a rate of O( τ ). [sent-211, score-0.304]

50 This matches the known rates for regular (non-accelerated) gradient descent on functions with Lipschitz continuous gradient (e. [sent-212, score-0.281]

51 , see [14]), although in practice coordinate minimization is often much faster. [sent-214, score-0.414]

52 5 The main difference between the greedy and stochastic rates is that the factor |S| (the number of blocks) does not appear in the greedy rate, and does appear in the stochastic one. [sent-217, score-0.515]

53 This can have a considerable effect since |S| is either the number of variables n (in the star update) or the number of factors |C| (in MPLP). [sent-218, score-0.245]

54 The greedy algorithm does pay a price for this advantage, since it has to find the optimal block to update at each iteration. [sent-222, score-0.4]

55 7 Indeed, our empirical results show that the greedy algorithm consistently outperforms the stochastic one (see Section 6). [sent-229, score-0.216]

56 5 Primal convergence Thus far we have considered only dual variables. [sent-230, score-0.35]

57 However, it is often important to recover the primal variables. [sent-231, score-0.427]

58 We therefore focus on extracting primal feasible solutions from current δ, and characterize the degree of primal optimality and associated rates. [sent-232, score-0.938]

59 This is true also for other approaches to recovering primal variables from the dual, such as averaging subgradients when using subgradient descent (see, e. [sent-236, score-0.554]

60 We propose a simple two-step algorithm for transforming any dual variables δ into primal feasible variables µ(δ) ∈ ML . [sent-239, score-0.908]

61 The resulting µ(δ) will also be shown to converge to the optimal primal ˜ ˜ solution in Section 5. [sent-240, score-0.482]

62 Algorithm 1 Mapping to feasible primal solution Step 1: Make marginals consistent. [sent-243, score-0.598]

63 λ=0 for c ∈ C, xc do if µc (xc ) < 0 then ¯ 1 |Xc\i | (µc (xi ) − µi (xi )) ¯ µ λ = max λ, −¯ −¯c (xc ) 1 µ (x )+ c c |Xc | else if µc (xc ) > 1 then ¯ ¯ λ = max λ, µ µc (xc )−1 ¯ (x )− 1 c c |Xc | end if end for for = 1, . [sent-245, score-0.404]

64 In the second step we “pull” µ back into the feasible regime by taking a convex combination ¯ ¯ 7 This was also used in the residual belief propagation approach [4], which however is less theoretically justified than what we propose here. [sent-253, score-0.191]

65 1 Primal convergence rate Now that we have a procedure for obtaining a primal solution we analyze the corresponding convergence rate. [sent-258, score-0.664]

66 First, we show that if we have δ for which F (δ) ∞ ≤ then µ(δ) (after Algorithm ˜ 1) is an O( ) primal optimal solution. [sent-259, score-0.427]

67 Denote by Pτ the optimum of the smoothed primal P M APτ . [sent-262, score-0.563]

68 For any set of dual variables δ, and any ∈ R(τ ) (see supp. [sent-263, score-0.34]

69 We can now translate dual rates into primal rates. [sent-270, score-0.819]

70 Given any algorithm for optimizing DM APτ and ∈ R(τ ), if the algorithn is guaranteed to achieve F (δ t ) − F (δ ∗ ) ≤ after O(g( )) iterations, then it is guaranteed to be 2 ∗ µ primal optimal, i. [sent-277, score-0.483]

71 8 The theorem lets us directly translate dual convergence rates into primal ones. [sent-280, score-0.886]

72 Note that it applies to any algorithm for DM APτ (not only coordinate minimization), and the only property of the algorithm used in the proof is F (δ t ) ≤ F (0) for all t. [sent-281, score-0.245]

73 µ 6 Experiments In this section we evaluate coordinate minimization algorithms on a MAP problem, and compare them to state-of-the-art baselines. [sent-283, score-0.414]

74 Specifically, we compare the running time of greedy coordinate minimization, stochastic coordinate minimization, full gradient descent, and FISTA – an accelerated gradient method [1] (details on the gradient-based algorithms are provided in the supplementary, Section 3). [sent-284, score-0.896]

75 We first notice that the greedy algorithm converges faster than the stochastic one. [sent-291, score-0.216]

76 Second, we observe that the coordinate minimization algorithms are competitive with the accelerated gradient method FISTA and are much faster than the gradient method. [sent-293, score-0.604]

77 3 predicts, primal convergence is slower than dual convergence (notice the logarithmic timescale). [sent-295, score-0.844]

78 Finally, we can see that better convergence of the dual objective corresponds to better convergence of the primal objective, in both fractional and integral domains. [sent-296, score-0.991]

79 In our experiments the quality of the decoded integral solution (dashed lines) significantly exceeds that of the fractional solution. [sent-297, score-0.189]

80 Although sometimes a fractional solution can be useful in itself, this suggests that if only an integral solution is sought then it could be enough to decode directly from the dual variables. [sent-298, score-0.511]

81 5 0 −50 −2 10 0 10 2 10 Runtime (secs) 4 6 10 10 Figure 1: Comparison of coordinate minimization, gradient descent, and the accelerated gradient algorithms on protein side-chain prediction task. [sent-306, score-0.474]

82 (4)) of the feasible primal solution of Algorithm 1 is also shown (lower solid line), as well as the objective (Eq. [sent-311, score-0.575]

83 (1)) of the best decoded integer solution (dashed line; those are decoded directly from the dual variables δ). [sent-312, score-0.456]

84 7 Discussion We presented the first convergence rate analysis of dual coordinate minimization algorithms on MAP-LP relaxations. [sent-326, score-0.808]

85 We also showed how such dual iterates can be turned into primal feasible iterates and analyzed the rate with which these primal iterates converge to the primal optimum. [sent-327, score-1.814]

86 The primal mapping is of considerable practical value, as it allows us to monitor the distance between the upper (dual) and lower (primal) bounds on the optimum and use this as a stopping criterion. [sent-328, score-0.508]

87 Note that this cannot be done without a primal feasible solution. [sent-329, score-0.511]

88 This could partially explain the excellent performance of the greedy scheme when compared to FISTA (see Section 6). [sent-335, score-0.192]

89 Our analysis also highlights the advantage of using greedy block choice for MAP problems. [sent-336, score-0.336]

90 The advantage comes from the fact that the choice of block to update is quite efficient since its cost is of the order of the other computations required by the algorithm. [sent-337, score-0.233]

91 How should one choose between different block updates (e. [sent-340, score-0.198]

92 9 An alternative commonly used progress criterion is to decode an integral solution from the dual variables, and see if its value is close to the dual upper bound. [sent-349, score-0.694]

93 Fast and smooth: Accelerated dual decomposition for MAP inference. [sent-404, score-0.313]

94 An alternating direction method for dual map lp relaxation. [sent-434, score-0.506]

95 Efficiency of coordinate descent methods on huge-scale optimization problems. [sent-457, score-0.315]

96 On the finite time convergence of cyclic coordinate descent methods, 2010. [sent-462, score-0.429]

97 A study of Nesterov’s scheme for lagrangian decomposition and map labeling. [sent-470, score-0.18]

98 Efficient mrf energy minimization via adaptive o diminishing smoothing. [sent-478, score-0.239]

99 Convergence of a block coordinate descent method for nondifferentiable minimization 1. [sent-498, score-0.653]

100 How to compute primal solution from dual one in MAP inference in MRF? [sent-523, score-0.742]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('primal', 0.427), ('xc', 0.404), ('dual', 0.283), ('ap', 0.247), ('coordinate', 0.245), ('minimization', 0.169), ('block', 0.169), ('greedy', 0.167), ('si', 0.152), ('lp', 0.141), ('star', 0.137), ('xi', 0.122), ('ml', 0.118), ('fista', 0.11), ('dm', 0.11), ('meshi', 0.088), ('ci', 0.086), ('feasible', 0.084), ('rates', 0.083), ('map', 0.082), ('smoothed', 0.078), ('ni', 0.073), ('msd', 0.073), ('descent', 0.07), ('fractional', 0.068), ('convergence', 0.067), ('gradient', 0.064), ('update', 0.064), ('accelerated', 0.062), ('lipschitz', 0.058), ('optimum', 0.058), ('relaxations', 0.058), ('variables', 0.057), ('marginals', 0.055), ('coordinates', 0.05), ('message', 0.049), ('shall', 0.049), ('decode', 0.049), ('stochastic', 0.049), ('lagrangean', 0.049), ('talg', 0.049), ('cyclic', 0.047), ('integral', 0.047), ('globerson', 0.044), ('rate', 0.044), ('schemes', 0.044), ('lagrangian', 0.043), ('kappes', 0.043), ('ldpc', 0.043), ('mplp', 0.043), ('savchynskyy', 0.043), ('decoded', 0.042), ('propagation', 0.042), ('nesterov', 0.04), ('improvement', 0.04), ('ofer', 0.04), ('hmax', 0.04), ('mrf', 0.039), ('protein', 0.039), ('runtime', 0.037), ('smooth', 0.036), ('passing', 0.036), ('programming', 0.036), ('belief', 0.036), ('scheduling', 0.035), ('jaakkola', 0.035), ('proteins', 0.034), ('yanover', 0.034), ('iterates', 0.033), ('tommi', 0.033), ('smoothing', 0.032), ('objective', 0.032), ('solution', 0.032), ('sontag', 0.031), ('priority', 0.031), ('energy', 0.031), ('decomposition', 0.03), ('fi', 0.029), ('updates', 0.029), ('residual', 0.029), ('blocks', 0.028), ('factors', 0.028), ('guaranteed', 0.028), ('analyze', 0.027), ('polytope', 0.027), ('schmidt', 0.027), ('iterations', 0.026), ('singleton', 0.026), ('subsets', 0.026), ('translate', 0.026), ('facilitates', 0.026), ('sm', 0.026), ('scheme', 0.025), ('convergent', 0.024), ('assignment', 0.024), ('supplementary', 0.024), ('solvers', 0.023), ('converge', 0.023), ('considerable', 0.023), ('non', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

2 0.40418336 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

3 0.22662826 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.19181187 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

5 0.15894613 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

6 0.12625009 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

7 0.12353658 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

8 0.11850668 344 nips-2012-Timely Object Recognition

9 0.11821128 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

10 0.11773229 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

11 0.10178432 324 nips-2012-Stochastic Gradient Descent with Only One Projection

12 0.10127676 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

13 0.098447874 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

14 0.096554741 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

15 0.087624572 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

16 0.084591642 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

17 0.0812774 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

18 0.078835636 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

19 0.07617496 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.070476837 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.22), (1, 0.014), (2, 0.112), (3, -0.118), (4, 0.043), (5, 0.076), (6, -0.035), (7, -0.127), (8, -0.032), (9, -0.019), (10, -0.012), (11, 0.011), (12, 0.049), (13, 0.186), (14, 0.106), (15, 0.083), (16, -0.24), (17, -0.192), (18, 0.006), (19, 0.098), (20, 0.231), (21, -0.033), (22, 0.036), (23, 0.03), (24, -0.149), (25, -0.044), (26, 0.051), (27, -0.006), (28, 0.121), (29, 0.195), (30, 0.145), (31, -0.051), (32, -0.032), (33, 0.021), (34, -0.044), (35, -0.12), (36, -0.071), (37, 0.031), (38, -0.004), (39, -0.011), (40, 0.012), (41, 0.064), (42, 0.007), (43, 0.082), (44, 0.019), (45, 0.071), (46, 0.016), (47, -0.01), (48, -0.024), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97862494 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

2 0.96679455 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

3 0.87812579 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.60761189 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

5 0.54505992 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Abstract: Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, finding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints. 1

6 0.53643125 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

7 0.51948434 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

8 0.49718207 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

9 0.47986442 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

10 0.47159478 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

11 0.46640557 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

12 0.45578489 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

13 0.45415896 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

14 0.44111645 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

15 0.43113476 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

16 0.43054512 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

17 0.42041034 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

18 0.41804728 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

19 0.41363198 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

20 0.40716031 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.02), (21, 0.018), (38, 0.581), (42, 0.027), (54, 0.014), (74, 0.027), (76, 0.1), (80, 0.075), (92, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99525881 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

Author: Zhuo Wang, Alan Stocker, Daniel Lee

Abstract: In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1

2 0.98356402 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1

same-paper 3 0.98235106 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

4 0.9752667 165 nips-2012-Iterative ranking from pair-wise comparisons

Author: Sahand Negahban, Sewoong Oh, Devavrat Shah

Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1

5 0.95173353 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams

Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1

6 0.95098013 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

7 0.95003355 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

8 0.93648052 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

9 0.9335925 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

10 0.92857921 208 nips-2012-Matrix reconstruction with the local max norm

11 0.88796139 285 nips-2012-Query Complexity of Derivative-Free Optimization

12 0.88167715 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

13 0.87896979 86 nips-2012-Convex Multi-view Subspace Learning

14 0.87383926 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

15 0.86324966 324 nips-2012-Stochastic Gradient Descent with Only One Projection

16 0.8625527 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

17 0.86173475 30 nips-2012-Accuracy at the Top

18 0.85555553 187 nips-2012-Learning curves for multi-task Gaussian process regression

19 0.85518855 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

20 0.85235411 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback