nips nips2013 nips2013-318 knowledge-graph by maker-knowledge-mining

318 nips-2013-Structured Learning via Logistic Regression


Source: pdf

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. [sent-4, score-0.3]

2 This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. [sent-5, score-0.529]

3 In these logistic regression problems, each training example has a bias term determined by the current set of messages. [sent-6, score-0.381]

4 Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss. [sent-7, score-0.467]

5 1 Introduction The structured learning problem is to find a function F (x, y) to map from inputs x to outputs as y ∗ = arg maxy F (x, y). [sent-8, score-0.156]

6 F is chosen to optimize a loss function defined on these outputs. [sent-9, score-0.133]

7 A major challenge is that evaluating the loss for a given function F requires solving the inference optimization to find the highest-scoring output y for each exemplar, which is NP-hard in general. [sent-10, score-0.19]

8 A standard solution to this is to write the loss function using an LP-relaxation of the inference problem, meaning an upper-bound on the true loss. [sent-11, score-0.19]

9 The learning problem can then be phrased as a joint optimization of parameters and inference variables, which can be solved, e. [sent-12, score-0.117]

10 , by alternating message-passing updates to inference variables with gradient descent updates to parameters [16, 9]. [sent-14, score-0.353]

11 Previous work has mostly focused on linear energy functions F (x, y) = wT Φ(x, y), where a vector of weights w is adjusted in learning, and Φ(x, y) = α Φ(x, yα ) decomposes over subsets of variables yα . [sent-15, score-0.327]

12 ensembles of trees [20, 8, 25, 13, 24, 18, 19] or multi-layer perceptrons [10, 21]) to predict each variable independently. [sent-19, score-0.165]

13 The learning problem is to select fα from some set of functions Fα . [sent-22, score-0.099]

14 Here, following previous work [15], we add entropy smoothing to the LP-relaxation of the inference problem. [sent-23, score-0.194]

15 Again, this leads to phrasing the learning problem as a joint optimization of learning and inference variables, alternating between message-passing updates to inference variables and optimization of the functions fα . [sent-24, score-0.456]

16 The major result is that minimization of the loss over fα ∈ Fα can be re-formulated as a logistic regression problem, with a “bias” vector added to each example reflecting the current messages incoming to factor α. [sent-25, score-0.679]

17 No assumptions are needed on the sets of functions Fα , beyond assuming that an algorithm exists to optimize the logistic loss on a given dataset over all fα ∈ Fα We experimentally test the results of varying Fα to be the set of linear functions, multi-layer perceptrons, or boosted decision trees. [sent-26, score-0.571]

18 1 2 Structured Prediction The structured prediction problem can be written as seeking a function h that will predict an output y from an input x. [sent-28, score-0.086]

19 Most commonly, it can be written in the form h(x; w) = arg max wT Φ(x, y), (1) y where Φ is a fixed function of both x and y. [sent-29, score-0.145]

20 It is further assumed that Φ decomposes into a sum of functions evaluated over subsets of variables yα as Φ(x, y) = Φα (x, yα ). [sent-31, score-0.149]

21 This paper considers the structured learning problem in a more general setting, directly handling nonlinear function classes. [sent-33, score-0.086]

22 We generalize the function h to h(x; F ) = arg max F (x, y), y where the energy F again decomposes as F (x, y) = fα (x, yα ). [sent-34, score-0.372]

23 α The learning problem now becomes to select {fα ∈ Fα } for some set of functions Fα . [sent-35, score-0.099]

24 Here, we do not make any assumption on the class of functions Fα other than assuming that there exists an algorithm to find the best function fα ∈ Fα in terms of the logistic regression loss (Section 6). [sent-37, score-0.42]

25 , (xN , y N ), we wish to select the energy F to minimize the empirical risk l(xk , y k ; F ), R(F ) = (2) k for some loss function l. [sent-41, score-0.25]

26 Absent computational concerns, a standard choice would be the slackrescaled loss [22] l0 (xk , y k ; F ) = max F (xk , y) − F (xk , y k ) + ∆(y k , y), (3) y where ∆(y k , y) is some measure of discrepancy. [sent-42, score-0.148]

27 We assume that ∆ is a function that decomposes k over α, (i. [sent-43, score-0.089]

28 If this inference problem must be solved approximately, there is strong motivation [6] for using relaxations of the maximization in Eq. [sent-49, score-0.198]

29 It is easy to show that l1 ≥ l0 , since the two would be equivalent if µ were restricted to binary values, and hence the maximization in l1 takes place over a larger set [6]. [sent-53, score-0.081]

30 The maximization in l1 is of a linear objective under linear constraints, and is thus a linear program (LP), solvable in polynomial time using a generic LP solver. [sent-56, score-0.201]

31 Here, we make a further approximation to the loss, replacing the inference problem of maxµ∈M θ ·µ with the “smoothed” problem maxµ∈M θ · µ + α H(µα ), where H(µα ) is the entropy of the marginals µα . [sent-58, score-0.194]

32 The relaxed loss is l(xk , y k ; F ) = −F (xk , y k ) + max µ∈M k θF · µ + H(µα ) . [sent-61, score-0.148]

33 A similar result was previously given [15] bounding the difference of the objective obtained by inference with and without entropy smoothing. [sent-66, score-0.194]

34 α 4 Overview Now, the learning problem is to select the functions fα composing F to minimize R as defined in Eq. [sent-69, score-0.14]

35 Inspired by previous work [16, 9], our solution (Section 5) is to introduce a vector of “messages” λ to write A in the dual form A(θ) = min A(λ, θ), λ which leads to phrasing learning as the joint minimization k −F (xk , y k ) + A(λk , θF ) . [sent-74, score-0.158]

36 For fixed F , messagepassing can be used to perform coordinate ascent updates to all the messages λk (Section 5). [sent-76, score-0.305]

37 These updates are trivially parallelized with respect to k. [sent-77, score-0.093]

38 However, the problem remains, for fixed messages, how to optimize the functions fα composing F . [sent-78, score-0.161]

39 Section 7 observes that this problem can be re-formulated into a (non-structured) logistic regression problem, with “bias” terms added to each example that reflect the current messages into factor α. [sent-79, score-0.619]

40 5 Inference In order to evaluate the loss, it is necessary to solve the maximization in Eq. [sent-80, score-0.081]

41 For a given θ, consider doing inference over µ, that is, in solving the maximization in Eq. [sent-82, score-0.198]

42 Standard Lagrangian duality theory gives the following dual representation for A(θ) in terms of “messages” λα (xβ ) from a region α to a subregion β ⊂ α, a variant of the representation of Heskes [11]. [sent-84, score-0.096]

43 3 Algorithm 1 Reducing structured learning to logistic regression. [sent-85, score-0.289]

44 For all k, for all α, set the bias term to  1 k bk (y α ) ← ∆(yα , y α ) + α λk (yβ ) − α γ⊃α β⊂α 2. [sent-88, score-0.293]

45 For all α, solve the logistic regression problem  λk (yα ) . [sent-89, score-0.287]

46 γ K fα ∈Fα exp fα (xk , yα ) + bk (yα ) α k k fα (xk , yα ) + bk (yα ) − log α fα ← arg max . [sent-90, score-0.727]

47 A(θ) can be represented in the dual form A(θ) = minλ A(λ, θ), where A(λ, θ) = max θ · µ + λα (xβ ) (µαβ (yβ ) − µβ (yβ )) , H(µα ) + µ∈N α (8) α β⊂α xβ and N = {µ| yα µα (yα ) = 1, µα (yα ) ≥ 0} is the set of locally normalized pseudomarginals. [sent-97, score-0.139]

48 Thus, for any set of messages λ, there is an easily-evaluated upper-bound A(λ, θ) ≥ A(θ), and when A(λ, θ) is minimized with respect to λ, this bound is tight. [sent-99, score-0.248]

49 In our experiments, we use blocks consisting of the set of all messages λα (yν ) for all regions α containing ν. [sent-102, score-0.248]

50 When the graph only contains regions for single variables and pairs, this is a “star update” of all the messages from pairs that contain a variable i. [sent-103, score-0.248]

51 It can be shown [11, 15] that the update is λα (yν ) ← λα (yν ) + 1 + Nν log µα (yν )) − log µα (yν ), (log µν (yν ) + (10) α ⊃ν for all α ⊃ ν, where Nν = |{α|α ⊃ ν}|. [sent-104, score-0.082]

52 6 Logistic Regression Logistic regression is traditionally understood as defining a conditional distribution p(y|x; W ) = exp ((W x)y ) /Z(x) where W is a matrix that maps the input features x to a vector of margins W x. [sent-107, score-0.18]

53 It is easy to show that the maximum conditional likelihood training problem maxW k log p(y k |xk ; W ) is equivalent to (W xk )yk − log max W exp(W xk )y . [sent-108, score-1.03]

54 ) Secondly, we assume that there is a pre-determined “bias” vector bk associated with each training example. [sent-112, score-0.289]

55 This yields the learning problem f (xk , y k ) + bk (y k ) − log max f ∈F exp f (xk , y) + bk (y) (11) , y k Aside from linear logistic regression, one can see decision trees, multi-layer perceptrons, and boosted ensembles under an appropriate loss as solving Eq. [sent-113, score-1.152]

56 11 for different sets of functions F (albeit possibly to a local maximum). [sent-114, score-0.099]

57 7 Training Recall that the learning problem is to select the functions fα ∈ Fα so as to minimize the empirical k risk R(F ) = k [−F (xk , y k ) + A(θF )]. [sent-115, score-0.099]

58 12, we alternating between optimization of messages {λk } and energy functions k {fα }. [sent-119, score-0.492]

59 Optimization with respect to λk for fixed F decomposes into minimizing A(λk , θF ) indepenk dently for each y , which can be done by running message-passing updates as in Section 5 using the k parameter vector θF . [sent-120, score-0.146]

60 ρ is the minimizer of Eq 12 for fixed messages λ, then ∗ fα = arg max fα k k fα (xk , yα ) + bk (yα ) − log α exp fα (xk , yα ) + bk (yα ) α , (13) yα k where the set of biases are defined as  1 k ∆(yα , y α ) + bk (y α ) = α λα (yβ ) − β⊂α γ⊃α  λγ (yα ) . [sent-128, score-1.219]

61 α + α β⊂α xβ Using the definition of bk from Eq. [sent-133, score-0.244]

62 Applying Lemma 3 to the inner maximization gives the closed-form expression k A(λk , θF ) = log α exp 1 fα (x, yα ) + bα (yα ) . [sent-193, score-0.175]

63 For example, it is common to model image segmentation problems using a 4-connected grid with an energy like F (x, y) = i u(φi , yi ) + v(φij , yi , yj ), where φi /φij are univariate/pairwise features determined by x, and u and v ij are functions mapping local features to local energies. [sent-203, score-0.928]

64 In this case, u would be selected to maxk k imize k i u(φk , yi ) + bk (yi ) − log yi exp u(φk , yi ) + bk (yi ) , and analogous expresi i i i sion exists for v. [sent-204, score-0.897]

65 8 Experiments These experiments consider three different function classes: linear, boosted decision trees, and multi-layer perceptrons. [sent-206, score-0.135]

66 11 under linear functions f (x, y) = (W x)y , we simply compute the gradient with respect to W and use batch L-BFGS. [sent-208, score-0.176]

67 Boosted decision trees use stochastic gradient boosting [7]: the gradient of the logistic loss is computed for each exemplar, and a regression tree is induced to fit this (one tree for each class). [sent-212, score-0.965]

68 Then, an optimization adjusts the values of leaf nodes to optimize the logistic loss. [sent-214, score-0.31]

69 Finally, the tree values are multiplied by 2 At each time, the new step is a combination of . [sent-215, score-0.087]

70 8 ij yij=(2,1) yij=(2,2) 0 ij 0 −100 0 1 1 y =(1,1) 50 yij=(2,1) f ij f 0. [sent-226, score-0.483]

71 2 100 y =(1,2) ij yij=(2,2) fij −400 0 1 y =(1,1) ij y =(1,2) 50 0 i 0 −200 0. [sent-233, score-0.467]

72 2 i 200 f i f fi −200 y =2 i 200 0 −400 0 yi=1 y =2 i 200 400 yi=1 −50 0. [sent-234, score-0.085]

73 8 1 MLP Figure 1: The univariate (top) and pairwise (bottom) energy functions learned on denoising data. [sent-242, score-0.431]

74 Each column shows the result of training both univariate and pairwise terms with one function class. [sent-243, score-0.2]

75 2 0 0 Horses Linear 10 20 Iteration 0 10 20 Iteration 0 10 20 Iteration Figure 2: Dashed/Solid lines show univariate train/test error rates as a function of learning iterations for varying univariate (rows) and pairwise (columns) classifiers. [sent-256, score-0.3]

76 Each learning iteration consists of updating fi , performing 25 iterations of message passing, updating fij , and then performing another 25 iterations of message-passing. [sent-262, score-0.515]

77 Next, if yi = 0, φk i k is sampled uniformly from [0, . [sent-266, score-0.105]

78 Finally, for a pair i k k k k (i, j), if yi = yj , then φk is sampled from [0, . [sent-269, score-0.15]

79 A ij k k constant feature is also added to both φi and φij . [sent-272, score-0.197]

80 Finally, because there is only a single input feature for univariate and pairwise terms, the resulting functions are plotted in Fig. [sent-279, score-0.215]

81 Second, as a more realistic example, we use the Weizmann horses dataset. [sent-281, score-0.128]

82 We use 42 univariate features fik consisting of a constant (1) the RBG values of the pixel (3), the vertical and horizontal position (2) and a histogram of gradients [2] (36). [sent-282, score-0.155]

83 9 Discussion This paper observes that in the structured learning setting, the optimization with respect to energy can be formulated as a logistic regression problem for each factor, “biased” by the current messages. [sent-286, score-0.559]

84 Thus, it is possible to use any function class where an “oracle” exists to optimize a logistic loss. [sent-287, score-0.263]

85 Besides the possibility of using more general classes of energies, another advantage of the proposed method is the “software engineering” benefit of having the algorithm for fitting the energy modularized from the rest of the learning procedure. [sent-288, score-0.138]

86 The ability to easily define new energy functions for individual problems could have practical impact. [sent-289, score-0.198]

87 In related work, Hazan and Urtasun [9] use a linear energy, and alternate between updating all inference variables and a gradient descent update to parameters, using an entropy-smoothed inference objective. [sent-291, score-0.402]

88 [16] also use a linear energy, with a stochastic algorithm updating inference variables and taking a stochastic gradient step on parameters for one exemplar at a time, with a pure LP-relaxation of inference. [sent-293, score-0.358]

89 The proposed method iterates between updating all inference variables and performing a full optimization of the energy. [sent-294, score-0.207]

90 In practice, however, inference is easily parallelized over the data, and the majority of computational time is spent in the logistic regression subproblems. [sent-296, score-0.44]

91 Another related work is Gradient Tree Boosting [4] in which to train a CRF, the functional gradient of the conditional likelihood is computed, and a regression tree is induced. [sent-298, score-0.247]

92 The main limitation is the assumption that inference can be solved exactly. [sent-300, score-0.117]

93 It appears possible to extend this to inexact inference, where the tree is induced to improve a dual bound, but this has not been done so far. [sent-301, score-0.151]

94 Experimentally, however, simply inducing a tree on the loss gradient leads to much slower learning if the leaf nodes are not modified to optimize the logistic loss. [sent-302, score-0.546]

95 Thus, it is likely that such a strategy would still benefit from using the logistic regression reformulation. [sent-303, score-0.287]

96 Efficient learning of structured predictors in general graphical models. [sent-334, score-0.086]

97 Convexity arguments for efficient minimization of the bethe and kikuchi free energies. [sent-343, score-0.085]

98 On parameter learning in CRF-based approaches to object class image segmentation. [sent-374, score-0.089]

99 Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. [sent-384, score-0.193]

100 Scene segmentation with crfs learned from partially labeled images. [sent-394, score-0.157]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xk', 0.414), ('mlp', 0.249), ('messages', 0.248), ('bk', 0.244), ('logistic', 0.203), ('boosting', 0.187), ('ij', 0.161), ('meshi', 0.147), ('fij', 0.145), ('energy', 0.138), ('yij', 0.128), ('horses', 0.128), ('inference', 0.117), ('univariate', 0.112), ('yi', 0.105), ('segmentation', 0.104), ('boosted', 0.091), ('decomposes', 0.089), ('tree', 0.087), ('structured', 0.086), ('fi', 0.085), ('regression', 0.084), ('maximization', 0.081), ('denoising', 0.078), ('entropy', 0.077), ('gradient', 0.076), ('max', 0.075), ('loss', 0.073), ('perceptrons', 0.073), ('exemplar', 0.073), ('pseudomarginals', 0.072), ('rbg', 0.072), ('arg', 0.07), ('dual', 0.064), ('optimize', 0.06), ('functions', 0.06), ('justin', 0.059), ('phrasing', 0.059), ('urtasun', 0.059), ('updates', 0.057), ('er', 0.056), ('classi', 0.056), ('jamie', 0.055), ('pushmeet', 0.055), ('rother', 0.055), ('iccv', 0.055), ('exp', 0.053), ('carsten', 0.053), ('crfs', 0.053), ('hmax', 0.053), ('updating', 0.052), ('bethe', 0.05), ('nowozin', 0.05), ('bias', 0.049), ('antonio', 0.048), ('trees', 0.048), ('observes', 0.048), ('hazan', 0.047), ('leaf', 0.047), ('ofer', 0.047), ('daphne', 0.047), ('alternating', 0.046), ('image', 0.046), ('tommi', 0.045), ('winn', 0.045), ('layout', 0.045), ('yj', 0.045), ('training', 0.045), ('decision', 0.044), ('ensembles', 0.044), ('features', 0.043), ('object', 0.043), ('marginalized', 0.043), ('pairwise', 0.043), ('ers', 0.042), ('amir', 0.042), ('sebastian', 0.042), ('composing', 0.041), ('log', 0.041), ('linear', 0.04), ('ijcv', 0.04), ('local', 0.039), ('iteration', 0.039), ('select', 0.039), ('wt', 0.038), ('performing', 0.038), ('jaakkola', 0.037), ('parallelized', 0.036), ('added', 0.036), ('minimization', 0.035), ('smoothed', 0.034), ('iterations', 0.033), ('stephen', 0.033), ('lp', 0.032), ('custom', 0.032), ('heskes', 0.032), ('deva', 0.032), ('elidan', 0.032), ('gal', 0.032), ('subregion', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

2 0.15025586 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

3 0.13844207 149 nips-2013-Latent Structured Active Learning

Author: Wenjie Luo, Alex Schwing, Raquel Urtasun

Abstract: In this paper we present active learning algorithms in the context of structured prediction problems. To reduce the amount of labeling necessary to learn good models, our algorithms operate with weakly labeled data and we query additional examples based on entropies of local marginals, which are a good surrogate for uncertainty. We demonstrate the effectiveness of our approach in the task of 3D layout prediction from single images, and show that good models are learned when labeling only a handful of random variables. In particular, the same performance as using the full training set can be obtained while only labeling ∼10% of the random variables. 1

4 0.12452736 168 nips-2013-Learning to Pass Expectation Propagation Messages

Author: Nicolas Heess, Daniel Tarlow, John Winn

Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1

5 0.11705869 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

6 0.1160751 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

7 0.10958635 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

8 0.10503943 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors

9 0.1034807 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

10 0.10100085 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

11 0.099624187 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

12 0.09895806 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

13 0.097190946 75 nips-2013-Convex Two-Layer Modeling

14 0.092548683 268 nips-2013-Reflection methods for user-friendly submodular optimization

15 0.092229672 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

16 0.086604275 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

17 0.086488806 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

18 0.085398801 171 nips-2013-Learning with Noisy Labels

19 0.084461704 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

20 0.083246998 160 nips-2013-Learning Stochastic Feedforward Neural Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.254), (1, 0.049), (2, -0.038), (3, -0.037), (4, 0.098), (5, 0.064), (6, -0.085), (7, 0.021), (8, 0.064), (9, 0.025), (10, 0.006), (11, -0.001), (12, 0.031), (13, -0.088), (14, -0.039), (15, -0.004), (16, -0.108), (17, 0.022), (18, 0.012), (19, 0.047), (20, -0.087), (21, 0.088), (22, 0.069), (23, -0.012), (24, -0.012), (25, 0.01), (26, 0.098), (27, -0.051), (28, 0.099), (29, 0.009), (30, -0.04), (31, 0.18), (32, -0.076), (33, 0.054), (34, 0.087), (35, 0.055), (36, -0.13), (37, -0.061), (38, 0.066), (39, -0.006), (40, -0.133), (41, -0.052), (42, -0.103), (43, -0.012), (44, 0.099), (45, 0.047), (46, 0.014), (47, 0.048), (48, 0.172), (49, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95332116 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

2 0.67006177 168 nips-2013-Learning to Pass Expectation Propagation Messages

Author: Nicolas Heess, Daniel Tarlow, John Winn

Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1

3 0.62378728 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning

Author: Leonidas Lefakis, François Fleuret

Abstract: We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting. 1

4 0.59968674 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

Author: Tamir Hazan, Subhransu Maji, Joseph Keshet, Tommi Jaakkola

Abstract: In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that any smooth posterior distribution would suffice to define a smooth PAC-Bayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn super-modular potential functions that accompany specialized MAP predictors such as graph-cuts. We also describe label-augmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations. 1

5 0.59761155 211 nips-2013-Non-Linear Domain Adaptation with Boosting

Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua

Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1

6 0.58761173 184 nips-2013-Marginals-to-Models Reducibility

7 0.58678389 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting

8 0.58164811 340 nips-2013-Understanding variable importances in forests of randomized trees

9 0.57573462 189 nips-2013-Message Passing Inference with Chemical Reaction Networks

10 0.5646795 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

11 0.55643022 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics

12 0.54366183 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

13 0.54286498 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

14 0.53992897 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation

15 0.53932607 244 nips-2013-Parametric Task Learning

16 0.53108674 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

17 0.51278597 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles

18 0.50986415 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

19 0.50749725 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family

20 0.49463487 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.022), (16, 0.05), (33, 0.164), (34, 0.114), (41, 0.071), (46, 0.158), (49, 0.035), (56, 0.116), (70, 0.033), (85, 0.039), (89, 0.028), (93, 0.083), (95, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87284237 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

2 0.83136541 201 nips-2013-Multi-Task Bayesian Optimization

Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams

Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1

3 0.83114415 99 nips-2013-Dropout Training as Adaptive Regularization

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

4 0.83037591 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack

Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1

5 0.82990032 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

6 0.82634383 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

7 0.82303739 11 nips-2013-A New Convex Relaxation for Tensor Completion

8 0.82299286 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

9 0.82278377 5 nips-2013-A Deep Architecture for Matching Short Texts

10 0.82165086 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

11 0.8211655 251 nips-2013-Predicting Parameters in Deep Learning

12 0.81942749 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

13 0.81907451 301 nips-2013-Sparse Additive Text Models with Low Rank Background

14 0.81888914 350 nips-2013-Wavelets on Graphs via Deep Learning

15 0.8178817 69 nips-2013-Context-sensitive active sensing in humans

16 0.81769317 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding

17 0.81636339 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

18 0.81628805 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

19 0.81551737 149 nips-2013-Latent Structured Active Learning

20 0.81443608 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data