nips nips2012 nips2012-204 knowledge-graph by maker-knowledge-mining

204 nips-2012-MAP Inference in Chains using Column Generation


Source: pdf

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. [sent-7, score-0.443]

2 For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. [sent-10, score-0.213]

3 This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. [sent-11, score-0.503]

4 While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. [sent-12, score-0.266]

5 Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. [sent-13, score-0.302]

6 Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. [sent-14, score-0.187]

7 1 Introduction Many uses of graphical models either directly employ chains or tree structures—as in part-of-speech tagging—or employ them to enable inference in more complex models—as in junction trees and tree block coordinate descent [1]. [sent-16, score-0.474]

8 Traditional message-passing inference in these structures requires an amount of computation dependent on the product of the domain sizes of variables sharing an edge in the graph. [sent-17, score-0.296]

9 Even in chains, exact inference is prohibitive in tasks with large domains due to the quadratic dependence on domain size. [sent-18, score-0.268]

10 For this reason, many practitioners rely on beam search or other approximate inference techniques [2]. [sent-19, score-0.358]

11 We present a new algorithm for exact MAP inference in chains that is substantially faster than Viterbi in the typical case. [sent-22, score-0.391]

12 The combination of these ideas yields provably exact MAP inference for chains and trees that can be more than an order of magnitude faster than traditional methods. [sent-26, score-0.439]

13 Our algorithm has wideranging applicability, and we believe it could beneficially replace many traditional uses of Viterbi and beam search. [sent-27, score-0.163]

14 The use of column generation has not been widely explored or appreciated in graphical models. [sent-34, score-0.395]

15 5 times faster than Viterbi, and also faster than beam search with a width of two. [sent-38, score-0.355]

16 In joint POS tagging and named entity recognition, our method is thirteen times faster than Viterbi and also faster than beam search with a width of seven. [sent-39, score-0.532]

17 Column generation is a technique for exploiting this sparsity for faster inference. [sent-44, score-0.301]

18 It restricts an LP to a subset of its variables (implicitly setting the others to zero) and alternates between solving this restricted linear program and selecting which variables should be added to it, based on whether they could potentially improve the objective. [sent-45, score-0.243]

19 When no candidates remain, the current solution to the restricted problem is guaranteed to be the exact solution of the unrestricted problem. [sent-46, score-0.215]

20 The test to determine whether un-generated variables could potentially improve the objective is whether their reduced cost is positive, which is also the test employed by some pivoting rules in the simplex algorithm [8, 7]. [sent-47, score-0.318]

21 The difference between the algorithms is that simplex enumerates primal variables explicitly, while column generation “generates” them only as needed. [sent-48, score-0.479]

22 The key to an efficient column generation algorithm is an oracle that can either prove that no variable with positive reduced cost exists or produce one. [sent-49, score-0.69]

23 Ax ≤ b, x≥0 (1) With corresponding Lagrangian: L(x, λ) = cT x + λt (b − Ax) = Σi ci − AT λ xi + λt b. [sent-53, score-0.428]

24 i (2) For a given assignment to the dual variables λ, a variable xi is a candidate for being added to the restricted problem if its reduced cost ri = ci − AT λ, the scalar multiplying it in the Lagrangian, is i positive. [sent-54, score-0.973]

25 λ≥0 (3) Here, the reduced cost of a primal variable equals the degree to which its dual constraint is violated, and thus column generation in the primal is equivalent to cutting planes in the dual [7]. [sent-58, score-1.094]

26 If there is no variable of positive reduced cost, then the current dual variables from the restricted problem are feasible in the unrestricted problem, and thus we have a primal-dual optimal pair, and can terminate column generation. [sent-59, score-0.622]

27 An advantageous property of column generation that we employ later on is that it maintains primal feasibility across iterations, and thus it can be halted to provide approximate, anytime solutions. [sent-60, score-0.493]

28 2 3 Connection Between LP Relaxations and Message-Passing in Chains This section provides background showing how the LP formulation of the inference problem in chains leads to the known message-passing algorithm. [sent-61, score-0.266]

29 The LP for MAP inference in chains is as follows max. [sent-63, score-0.266]

30 We can restructure this LP to only depend on the pairwise assignment variables µi (xi , xi+1 ) by creating an edge between the last variable in the chain and an artificial variable and then “billing” all local scores to the pairwise edge that touches them from the right. [sent-69, score-0.387]

31 This leaves the following LP (with dual variables written after their corresponding constraints). [sent-71, score-0.163]

32 µi (xi , xi+1 )(τi (xi , xi+1 ) + θi (xi )) xn µn (xn , ·) = 1 xi−1 µi−1 (xi−1 , xi ) = xi+1 µi (xi , xi+1 ) i,xi ,xi+1 (N ) (αi (xi )) (5) The dual of this linear program is min. [sent-75, score-0.53]

33 A setting of the dual variables is optimal if maximization of the problem’s Lagrangian over the primal variables yields a primal-feasible setting. [sent-82, score-0.335]

34 The coefficients on the edge variables µi (xi , xi+1 ) are their reduced costs, αi (xi ) − αi+1 (xi+1 ) + θi (xi ) + τ (xi , xi+1 ). [sent-83, score-0.268]

35 (8) For duals that obey the constraints of (6), it is clear that the maximal reduced cost is zero, when xi is set to the argmax used when constructing αi+1 (xi+1 ). [sent-84, score-0.716]

36 1 Improving the reduced cost with information from both ends of the chain Column generation adds all variables with positive reduced cost to the restricted LP, but equation (8) leads to an inefficient algorithm because it is positive for many irrelevant edge settings. [sent-87, score-0.951]

37 Therefore, even if there is very strong local evidence against a particular 3 setting xi+1 , pairs xi , xi+1 may have positive reduced cost if the global transition factor τ (xi , xi+1 ) places positive weight on their compatibility. [sent-90, score-0.654]

38 Instead, if we split it halfway (now using phantom edges in both sides of the chain), we would obtain slightly different message passing rules and the following reduced cost expression: 1 αi (xi ) − αi+1 (xi+1 ) + (θi (xi ) + θj (xj )) + τ (xi , xi+1 ). [sent-93, score-0.304]

39 (9) 2 This contains local information for both xi and xi+1 , though it halves the magnitude of it. [sent-94, score-0.375]

40 In table 2 we demonstrate that this yields comparable performance to using the reduced cost of (8), which still outperforms Viterbi. [sent-95, score-0.232]

41 An even better reduced cost expression can be obtained by duplicating the marginalization constraints, we have: max. [sent-96, score-0.232]

42 This redundancy is important, though, because the resulting reduced cost 2Ri (xi , xi+1 ) = 2τ (xi , xi+1 ) + θi (xi ) + θi+1 (xi+1 ) + (αi (xi ) − αi+1 (xi+1 )) + (βi+1 (xi+1 ) − βi (xi )) . [sent-101, score-0.232]

43 In table 2 we show that column generation with (12) is fastest, which is not obvious, given the extra overhead of computing the β messages. [sent-103, score-0.333]

44 This is the reduced cost that we use in the following discussion and experiments, unless explicitly stated otherwise. [sent-104, score-0.232]

45 4 Column Generation Algorithm We present an algorithm for exact MAP inference that in practice is usually faster than traditional message passing. [sent-105, score-0.284]

46 Like all column generation algorithms, our technique requires components for three tasks: choosing the initial set of variables in the restricted LP, solving the restricted LP, and finding variables with positive reduced cost. [sent-106, score-0.841]

47 When no variable of positive reduced cost exists, the current solution to the restricted problem is optimal because we have a primal-feasible, dual-feasible pair. [sent-107, score-0.354]

48 1 Initialization To initialize the LP, we first define a restricted domain for each node in the graphical model consisting of only xL = argmax θi (xi ). [sent-112, score-0.229]

49 Other initialization strategies, such as adding the high-scoring i transitions, or the k best xi , are also valid. [sent-113, score-0.375]

50 2 Warm-Starting the Restricted LP Updating all messages using the max-product rules of equations (7) and (11) is a valid way to solve the restricted LP, but it doesn’t leverage the messages that were optimal for previous calls to the problem. [sent-117, score-0.321]

51 In practice, the restricted domains of every node are not updated at every iteration, and hence many of the previous messages may still appear in a dual-optimal setting of the current restricted problem. [sent-118, score-0.333]

52 As usual, solving the restricted LP, can be decomposed into independently solving each of the one-directional LPs, and thus we update α independently of β. [sent-119, score-0.162]

53 In some regions of the chain, we can avoid updating messages because we can guarantee that the proposed message updates would yield the same maximization and thus the same primal setting. [sent-121, score-0.282]

54 Simple rules include, for example, avoiding updating α to the left of the first updated domain and to avoid updating αi (∗) if |Di−1 |= 1, since maximization over |Di−1 | is trivial. [sent-122, score-0.162]

55 Furthermore, to the right of the the last updated domain, if we compute new messages αi (∗) and find that the argmax at the current MAP assignment x∗ doesn’t i change, we can revert to the previous αi (∗) and terminate message passing. [sent-123, score-0.215]

56 When solving the restricted LP, some constraints are trivially satisfied because they only involve variables that are implicitly set to zero, and hence the corresponding dual variables can be set arbitrarily. [sent-125, score-0.374]

57 To prevent extraneous un-generated variables from having a high reduced cost, we choose duals by guessing values that should be feasible in the unrestricted LP, with a smaller computational cost than solving the unrestricted LP directly. [sent-126, score-0.49]

58 We employ the same update equation used for the in-domain messages in (7) and (11), and maximize over the restricted domain of the variable’s neighbor. [sent-127, score-0.315]

59 3 Reduced-Cost Oracle Exhaustively searching the chain for variables of positive reduced cost by iterating over all settings of all edges would be as expensive as exact max-product message-passing. [sent-130, score-0.403]

60 However, our oracle search strategy is efficient because it prunes these away using precomputed bounds on the transitions. [sent-131, score-0.257]

61 If in practice, most settings for each edge have negative reduced cost, we can efficiently find candi+ date settings by first upper-bounding Si (xi ) + 2τ (xi , xi+1 ), finding all possible values xi+1 that could yield a positive reduced cost, and then doing the reverse. [sent-133, score-0.354]

62 Finally, we search over the much smaller set of candidates for xi and xi+1 . [sent-134, score-0.425]

63 After the first round of column generation, if Ri (xi , xi+1 ) hasn’t changed for every xi , xi+1 , then no variables of positive reduced cost can exist because they would have been added in the previous iteration, and we can skip the oracle. [sent-136, score-0.793]

64 Lastly, a final pruning strategy is that if there are settings xi , xi such that θi (xi ) + min τ (xi−1 , xi ) + min τ (xi , xi+1 ) > θi (xi ) + max τ (xi−1 , xi ) + max τ (xi , xi+1 ), (14) xi−1 xi+1 xi−1 xi+1 then we know with certainty that setting xi is suboptimal. [sent-138, score-2.002]

65 We can do so by first linearly searching through the labels for a node for the one with highest local score and then using precomputed bounds on the transition scores to linearly discard states whose upper bound on the score is smaller than the lower bound of the best state. [sent-140, score-0.224]

66 First of all, our algorithm generalizes easily to MAP inference in trees by using a similar structure but a different reduced cost expression that considers messages flowing in both directions across each edge (appendix A. [sent-143, score-0.562]

67 The reduced cost oracle can also be used to compute the duality gap of an approximate solution. [sent-145, score-0.399]

68 This allows early stopping of our algorithm if the gap is small and also provides analysis of the sub-optimality of the output of beam search (appendix A. [sent-146, score-0.244]

69 Furthermore, margin violation queries when doing structured SVM training with a 0/1 loss can be done efficiently using a small modification of our algorithm, in which we also add variables of small negative reduced cost and do 2-best inference within the restricted domains (appendix A. [sent-148, score-0.571]

70 Most standard inference algorithms, such as Viterbi, do not have this behavior where the inference time is affected by the actual model scores. [sent-152, score-0.232]

71 By coupling inference and learning, practitioners have more freedom to trade off test-time speed vs. [sent-153, score-0.188]

72 6 Related Work Column generation has been employed as a way of dramatically speeding up MAP inference problems in Riedel et al [10], which applies it directly to the LP relaxation for dependency parsing with grandparent edges. [sent-155, score-0.345]

73 There has been substantial prior work on improving the speed of max-product inference in chains by pruning the search process. [sent-156, score-0.451]

74 CarpeDiem [11] relies on an an expression similar to the oriented, left-to-right reduced cost equation of (8), also with a similar pruning strategy to the one described in section 4. [sent-157, score-0.39]

75 [12] presented a staggered decoding strategy that similarly attempts to bound the best achievable score using uninstantiated domains, but only used local scores when searching for new candidates. [sent-160, score-0.229]

76 The dual variables obtained in earlier runs were then used to warm-start the inference in later runs, similarly to what is done in section 4. [sent-161, score-0.279]

77 However, their algorithms do not provide extensions to inference in trees, a margin-violation oracle, and approximate inference using a duality gap. [sent-164, score-0.269]

78 For example, in dual decomposition [16], inference in joint models is reduced to repeated inference in independent models. [sent-173, score-0.481]

79 Tree block-coordinate descent performs approximate inference in loopy models using exact inference in trees as a subroutine [1]. [sent-174, score-0.363]

80 Column generation is cutting planes in the dual, and cutting planes have been used successfully in various machine learning contexts. [sent-175, score-0.43]

81 test throughput for our algorithm estimate of the desirability of an edge setting, and thus our algorithm is heuristic search in the space of edge settings. [sent-179, score-0.217]

82 With dual feasibility, this heuristic is consistent, and thus our algorithm is iteratively constructing a heuristic such that it can perform A∗ search for the final restricted LP [20]. [sent-180, score-0.252]

83 7 Experiments We compare the performance of column generation with exact and approximate inference on Wall Street Journal [21] part-of-speech (POS) tagging and joint POS tagging and named-entityrecognition (POS/NER). [sent-181, score-0.781]

84 Table 1 compares the inference times and accuracies of column generation (CG), Viterbi, Viterbi with the final pruning technique described in section 4. [sent-185, score-0.567]

85 For POS, CG, is more than twice as fast as Viterbi, with speed comparable to a beam of size 3. [sent-188, score-0.206]

86 Exact inference in the model obtains a tagging accuracy of 95. [sent-191, score-0.255]

87 We observe a 13x speedup over Viterbi and are comparable in speed with a beam of size 7, while being exact. [sent-194, score-0.206]

88 Column generation always finished in at most three iterations, and 22% of the time it terminated after one. [sent-197, score-0.204]

89 86% of the time, the reduced-cost oracle iterated over at most 5 candidate edge settings, which is a significant reduction from the worst-case behavior of 452 . [sent-198, score-0.167]

90 The pruning strategy in Viterbi+P manages to restrict the number of possible labels for each token to at most 5 for over 65% of the tokens, and prunes the size of each domain by half over 95% of the time. [sent-199, score-0.217]

91 B shows column generation with two other reduced-cost formulations on the same POS tagging task. [sent-209, score-0.499]

92 1 Implemented by replacing all maximizations in the viterbi code with two-best maximizations. [sent-212, score-0.46]

93 1 Table 1: Comparing inference time and exactness of Column Generation (CG), Viterbi, Viterbi with the final pruning technique of section 4. [sent-242, score-0.234]

94 15%(CG+DG), and beam search on POS tagging (left) and joint POS/NER (right). [sent-244, score-0.352]

95 1 Table 2: (A) the speedups for a 0/1 loss oracle (B) comparing reduced cost formulations In Figure 2, we explore the ability to manipulate training time regularization to trade off test accuracy and test speed, as discussed in section 5. [sent-260, score-0.388]

96 8 Conclusions and future work In this paper we presented an efficient family of algorithms based on column generation for MAP inference in chains and trees. [sent-265, score-0.599]

97 Depending on the parameter settings it can be twice as fast as Viterbi in WSJ POS tagging and 13x faster in a joint POS-NER task. [sent-267, score-0.21]

98 One avenue of further work is to extend the bounding strategies in this algorithm for inference in cluster graphs or junction trees, allowing faster inference in higher-order chains or even loopy graphical models. [sent-268, score-0.485]

99 Parse, price and cutdelayed column and row generation for graph based parsers. [sent-334, score-0.333]

100 On dual decomposition and linear programming relaxations for natural language processing. [sent-373, score-0.157]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('viterbi', 0.46), ('xi', 0.375), ('lp', 0.296), ('pos', 0.248), ('generation', 0.204), ('cg', 0.182), ('beam', 0.163), ('chains', 0.15), ('reduced', 0.143), ('tagging', 0.139), ('column', 0.129), ('inference', 0.116), ('dual', 0.106), ('oracle', 0.099), ('messages', 0.098), ('restricted', 0.096), ('pruning', 0.092), ('primal', 0.089), ('cost', 0.089), ('riedel', 0.085), ('di', 0.076), ('lps', 0.075), ('faster', 0.071), ('edge', 0.068), ('unrestricted', 0.065), ('kaji', 0.064), ('maxxi', 0.064), ('staggered', 0.064), ('planes', 0.062), ('variables', 0.057), ('map', 0.057), ('sontag', 0.055), ('domain', 0.055), ('si', 0.055), ('exact', 0.054), ('linguistics', 0.053), ('ci', 0.053), ('relaxations', 0.051), ('cutting', 0.051), ('search', 0.05), ('street', 0.049), ('mcauley', 0.049), ('xn', 0.049), ('trees', 0.048), ('transition', 0.047), ('argmax', 0.046), ('xj', 0.045), ('message', 0.043), ('speed', 0.043), ('domains', 0.043), ('carpediem', 0.043), ('reducedcostoracle', 0.043), ('restructure', 0.043), ('samplerank', 0.043), ('scores', 0.04), ('thirteen', 0.038), ('duals', 0.038), ('precomputed', 0.038), ('wall', 0.037), ('duality', 0.037), ('feasibility', 0.036), ('appendix', 0.036), ('strategy', 0.035), ('employ', 0.035), ('score', 0.035), ('prunes', 0.035), ('contract', 0.035), ('xl', 0.034), ('solving', 0.033), ('graphical', 0.032), ('ui', 0.032), ('inef', 0.031), ('equation', 0.031), ('throughput', 0.031), ('gap', 0.031), ('chain', 0.031), ('dynamic', 0.03), ('explored', 0.03), ('speedups', 0.03), ('searching', 0.029), ('rules', 0.029), ('tree', 0.029), ('practitioners', 0.029), ('subroutine', 0.029), ('assignment', 0.028), ('transitions', 0.028), ('lagrangian', 0.028), ('prime', 0.027), ('violation', 0.027), ('formulations', 0.027), ('technique', 0.026), ('maximization', 0.026), ('dg', 0.026), ('tokens', 0.026), ('variable', 0.026), ('decoding', 0.026), ('updating', 0.026), ('wainwright', 0.025), ('constraints', 0.025), ('al', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

2 0.22662826 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

3 0.22518353 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

4 0.14113028 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

5 0.13170572 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

6 0.11076834 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

7 0.094657429 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

8 0.093160279 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

9 0.093081832 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

10 0.087405011 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

11 0.083345443 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

12 0.083104067 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.080765679 279 nips-2012-Projection Retrieval for Classification

14 0.077878706 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

15 0.074992463 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

16 0.074049778 242 nips-2012-Non-linear Metric Learning

17 0.070146002 227 nips-2012-Multiclass Learning with Simplex Coding

18 0.069503717 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

19 0.069044724 206 nips-2012-Majorization for CRFs and Latent Likelihoods

20 0.066057406 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.213), (1, -0.009), (2, 0.049), (3, -0.075), (4, -0.004), (5, 0.009), (6, -0.029), (7, -0.056), (8, -0.066), (9, 0.011), (10, 0.007), (11, 0.002), (12, 0.079), (13, 0.122), (14, 0.024), (15, 0.025), (16, -0.189), (17, -0.115), (18, 0.043), (19, 0.126), (20, 0.13), (21, 0.002), (22, -0.011), (23, 0.026), (24, -0.126), (25, 0.008), (26, -0.009), (27, -0.023), (28, 0.167), (29, 0.147), (30, 0.126), (31, 0.003), (32, -0.015), (33, 0.071), (34, -0.043), (35, -0.121), (36, -0.051), (37, 0.007), (38, -0.001), (39, 0.024), (40, -0.069), (41, 0.092), (42, -0.023), (43, 0.103), (44, 0.012), (45, 0.057), (46, 0.061), (47, -0.01), (48, -0.06), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97703063 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

2 0.91888571 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

Author: Alex Schwing, Tamir Hazan, Marc Pollefeys, Raquel Urtasun

Abstract: While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an -descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based formulation of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interaction problems and show that our approach outperforms state-of-the-art solvers. 1

3 0.8868348 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

4 0.57733625 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Abstract: Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, finding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints. 1

5 0.54751009 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

Author: Ashwini Shukla, Aude Billard

Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.

6 0.54557979 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

7 0.5365538 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

8 0.50644058 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

9 0.50604522 288 nips-2012-Rational inference of relative preferences

10 0.50061208 267 nips-2012-Perceptron Learning of SAT

11 0.48931956 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

12 0.48206896 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

13 0.47946006 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

14 0.47837591 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

15 0.47390023 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

16 0.46323261 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

17 0.45846468 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

18 0.45047578 242 nips-2012-Non-linear Metric Learning

19 0.44954059 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

20 0.43315962 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (21, 0.021), (38, 0.139), (42, 0.014), (54, 0.016), (55, 0.015), (74, 0.047), (76, 0.081), (80, 0.527), (92, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97414249 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefficient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been significant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

2 0.96427387 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efficient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

3 0.96392274 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

4 0.95884305 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

5 0.94335461 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the first discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efficient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classification tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

6 0.93271011 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

7 0.81727117 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.81236851 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.78743511 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

10 0.77387977 293 nips-2012-Relax and Randomize : From Value to Algorithms

11 0.77277446 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

12 0.76300311 200 nips-2012-Local Supervised Learning through Space Partitioning

13 0.75710422 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

14 0.74358076 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

15 0.73384619 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

16 0.72991943 197 nips-2012-Learning with Recursive Perceptual Representations

17 0.7238825 279 nips-2012-Projection Retrieval for Classification

18 0.72027367 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

19 0.71427113 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

20 0.71303803 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme