nips nips2009 nips2009-189 knowledge-graph by maker-knowledge-mining

189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning


Source: pdf

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i. [sent-2, score-0.112]

2 However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. [sent-5, score-0.037]

3 Early works concentrate on minimizing the required number of model corrections made by the algorithm through a single pass of training examples. [sent-8, score-0.096]

4 More recently, on-line learning is considered as a solution of large scale learning mainly because of its fast convergence property. [sent-9, score-0.023]

5 They usually still require several passes (or epochs) through the training examples to converge at a satisfying model. [sent-11, score-0.086]

6 Reading a large data set from disk to memory usually takes much longer than CPU time spent in learning. [sent-13, score-0.048]

7 That is, after processing all available training examples once, the learned model should generalize as well as possible so that used training example can really be removed from memory to minimize disk I/O time. [sent-15, score-0.104]

8 In natural learning, single-pass learning is also interesting because it allows for continual learning from unlimited training examples under the constraint of limited storage, resembling a nature learner. [sent-16, score-0.05]

9 Previously, many authors, including [3] and [4], have established that given a sufďŹ ciently large set of training examples, 2SGD can potentially achieve generalization performance as well as empirical optimum in a single pass through the training examples. [sent-17, score-0.16]

10 However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. [sent-18, score-0.037]

11 Many attempts to approximate the Hessian have been made. [sent-19, score-0.016]

12 But in online settings, we only have the surface of the loss function given one training example, as opposed to all in batch settings. [sent-22, score-0.111]

13 The search direction obtained by line search on such a surface rarely leads to empirical optimum. [sent-23, score-0.018]

14 A review of similar attempts can be found in Bottou’s tutorial [6], where he suggested that none is actually sufďŹ cient to achieve theoretical single-pass performance in practice. [sent-24, score-0.03]

15 PSA approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian 1 periodically. [sent-26, score-0.069]

16 We analyze the accuracy of the approximation and derive the asymptotic rate of convergence for PSA. [sent-28, score-0.038]

17 Experimental results show that for a wide variety of models and tasks, PSA is always very close to empirical optimum in a single-pass. [sent-29, score-0.029]

18 A machine learning problem can be formulated as a ďŹ xed-point iteration that solves the equation w = â„ł(w), where â„ł is a mapping â„ł : â„? [sent-35, score-0.037]

19 Then we can apply Aitken’s acceleration, which attempts to extrapolate to the local optimum in one step, to accelerate the convergence of the mapping: ′ w∗ = w(đ? [sent-41, score-0.087]

20 ‘Ą) ), ∗ (1) ∗ where J := â„ł (w ) is the Jacobian of the mapping â„ł at w . [sent-44, score-0.037]

21 ‘– := eig(J) ∈ (−1, 1), the mapping â„ł is guaranteed to converge. [sent-47, score-0.037]

22 It is usually difďŹ cult to compute J for even a simple machine learning model. [sent-51, score-0.015]

23 (3) In practice, Aitken’s acceleration alternates a step for preparing đ? [sent-80, score-0.065]

24 A beneďŹ t of the above approximation is that the cost for performing an extrapolation is đ? [sent-87, score-0.037]

25 3 Periodic Step-Size Adaptation When ℳ is a gradient descent update rule, that is, ℳ(w) � [sent-90, score-0.056]

26 œ‚ is a scalar step size, D is the entire set of training examples, and g(w; D) is the gradient of a loss function to be minimized, Aitken’s acceleration is equivalent to Newton’s method, because J = ℳ′ (w) = I − đ? [sent-93, score-0.144]

27 ‘” ′ (w; D), the Hessian matrix of the loss function, and the extrapolation given in (1) becomes 1 w = w + (I − J)−1 (â„ł(w) − w) = w − H−1 đ? [sent-98, score-0.059]

28 œ‚ In this case, Aitken’s acceleration enjoys the same local quadratic convergence as Newton’s method. [sent-101, score-0.071]

29 This can also be extended to a SGD update rule: w(đ? [sent-102, score-0.034]

30 A genuine on-line learner usually has âˆŁBâˆŁ = 1. [sent-109, score-0.015]

31 ›ž is an estimated eigenvalue of J as given in (2), when H is a symmetric matrix, its eigenvalue is given by 1 − eig(J) eig(J) = 1 − đ? [sent-118, score-0.106]

32 ‘– 2 Therefore, we can update the step size component-wise by (đ? [sent-123, score-0.051]

33 ‘– (5) Since the mapping â„ł in SGD involves the gradient g(w(đ? [sent-140, score-0.059]

34 It is unlikely that we can obtain a reliable eigenvalue estimation at each single iteration. [sent-144, score-0.053]

35 To increase stationary of the mapping, we take advantage of the law of large numbers and aggregate consecutive SGD mappings into a new mapping â„łđ? [sent-145, score-0.082]

36 which reduces the variance of gradient estimation by 1 , compared to the plain SGD mapping â„ł. [sent-155, score-0.13]

37 We can proceed to estimate the eigenvalues of â„łđ? [sent-168, score-0.016]

38 (6) We note that our aggregate mapping â„łđ? [sent-197, score-0.062]

39 Their difference is similar to that between batch and stochastic gradient descent. [sent-202, score-0.058]

40 chances to adjust its search direction, while mappings that use đ? [sent-205, score-0.02]

41 With the estimated eigenvalues, we can present the complete update rule to adjust the step size vector đ? [sent-208, score-0.051]

42 ) , (8) where v is a discount factor with components deďŹ ned by đ? [sent-243, score-0.015]

43 ›ź ≤ 1 ensures that the step size is decreasing and approaches zero so that SGD can be guaranteed to converge [7]. [sent-315, score-0.017]

44 In a nutshell, PSA applies SGD with a ďŹ xed step size and periodically updates the step size by approximating Jacobian of the aggregated mapping. [sent-317, score-0.054]

45 ‘‘ ) because the cost of eigenvalue estimation given in (6) is 2đ? [sent-320, score-0.053]

46 œ… ⊳ Equation (10) 3: repeat 4: Choose a small batch B(đ? [sent-352, score-0.036]

47 ‘Ą) uniformly at random from the set of training examples D 5: update đ? [sent-353, score-0.084]

48 ‘– be the largest eigenvalue of J such that đ? [sent-483, score-0.053]

49 This happens when there are high percentages of missing data for a Bayesian network model trained by EM [8] and when features are uncorrelated for training a conditional random ďŹ eld model [9]. [sent-576, score-0.068]

50 The rate of convergence is governed by the largest eigenvalue of S(đ? [sent-620, score-0.076]

51 The asymptotic rate of convergence of PSA is bounded by } { (0) −đ? [sent-625, score-0.023]

52 â–Ą Though this analysis suggests that for rapid convergence to đ? [sent-725, score-0.023]

53 When the training set size âˆŁDâˆŁ ≍ 2000, set đ? [sent-740, score-0.035]

54 This setting implies that the step size will be adjusted per âˆŁDâˆŁ/1000 examples. [sent-744, score-0.017]

55 Similarly, we can obtain that the factors for the other two settings are 0. [sent-788, score-0.022]

56 5 5 Experimental Results Table 1 shows the tasks chosen for our comparison. [sent-791, score-0.021]

57 The tasks for CRF have been used in competitions and the performance was measured by F-score. [sent-792, score-0.021]

58 Target provides the empirical optimal performance achieved by batch learners. [sent-794, score-0.036]

59 If PSA accurately approximates 2SGD, then its single-pass performance should be very close to Target. [sent-795, score-0.017]

60 99% Conditional Random Field We compared PSA with plain SGD and SMD [1] to evaluate PSA’s performance for training conditional random ďŹ elds (CRF). [sent-806, score-0.14]

61 We implemented PSA by replacing the L-BFGS optimizer in CRF++ [11]. [sent-807, score-0.02]

62 Finally, we ran the original CRF++ with default settings to obtain the performance results of LBFGS. [sent-811, score-0.045]

63 We simply used the original parameter settings for SGD and SMD as given in the literature. [sent-812, score-0.022]

64 All of the experiments reported here for CRF were ran on an Intel Q6600 Fedora 8 i686 PC with 4G RAM. [sent-828, score-0.023]

65 Table 2 compares SGD variants in terms of the execution time and F-scores achieved after processing the training examples for a single pass. [sent-829, score-0.064]

66 Since the loss function in CRF training is convex, the convergence results of L-BFGS can be considered as the empirical minimum. [sent-830, score-0.08]

67 Though as expected, plain SGD is the fastest, it is remarkable that PSA is faster than SMD for all tasks. [sent-834, score-0.071]

68 But PSA is still faster than SMD partly because PSA can take advantage of the sparsity trick as plain SGD [15]. [sent-836, score-0.133]

69 2 Linear SVM We also evaluated PSA’s single-pass performance for training linear SVM. [sent-838, score-0.035]

70 It is straightforward to apply PSA as a primal optimizer for linear SVM. [sent-839, score-0.036]

71 82 Table 2: CPU time in seconds and F-scores achieved after a single pass of CRF training. [sent-883, score-0.094]

72 We selected L2-regularized logistic regression as the loss function for PSA and Liblinear because it is twice differentiable. [sent-895, score-0.022]

73 PSA (1) is faster than SvmSgd (1) for SVM because SvmSgd uses the sparsity trick [15], which speeds up training for sparse data, but otherwise may slow down. [sent-906, score-0.097]

74 We implemented PSA with the sparsity trick for CRF only but not for SVM and CNN. [sent-910, score-0.062]

75 Method (pass) Liblinear converge Liblinear (1) SvmSgd (20) SvmSgd (10) SvmSgd (1) PSA (1) LS FD accuracy time 96. [sent-911, score-0.029]

76 33 Table 3: Test accuracy rates and elapsed CPU time in seconds by various linear SVM solvers. [sent-933, score-0.048]

77 7 The parameter settings for PSA are basically the same as those for CRF but with a large period đ? [sent-934, score-0.022]

78 3 Convolutional Neural Network Approximating Hessian is particularly challenging when the loss function is non-convex. [sent-946, score-0.022]

79 We tested PSA in such a setting by applying PSA to train a large convolutional neural network for the original 10-class MNIST task (see Table 1). [sent-947, score-0.032]

80 The differences include that the sub-sampling layers in LeNet-S picks only the upper-left value from a 2 Ă— 2 area and abandons the other three. [sent-951, score-0.025]

81 We also adapted a trick given in [19] which advises that step sizes in the lower layers should be larger than in the higher layer. [sent-973, score-0.104]

82 Following their trick, the initial step sizes for the √ ďŹ rst and the third layers were 5 and 2. [sent-974, score-0.042]

83 The experiments were ran on an Intel Q6600 Fedora 8 i686 PC with 4G RAM. [sent-976, score-0.023]

84 To obtain the empirical optimal error rate of our LeNet-S model, we ran plain SGD with sufďŹ cient passes and obtained 0. [sent-978, score-0.115]

85 Single-pass performance of PSA with the layer trick is within one percentage point to the target. [sent-981, score-0.09]

86 Starting from an initial weight closer to the optimum helped improving PSA’s performance further. [sent-982, score-0.029]

87 We ran SGD 100 passes with randomly selected 10K training examples then re-started training with PSA using the rest 50K training examples for a single pass. [sent-983, score-0.179]

88 Though PSA did achieve a better error rate, this is infeasible because it took 4492 seconds to run SGD 100 passes. [sent-984, score-0.019]

89 00 PSA w/o layer trick (1) PSA w/ layer trick (1) PSA re-start (1) time error 311. [sent-993, score-0.194]

90 90 Table 4: CPU time in seconds and percentage test error rates for various neural network trainers. [sent-999, score-0.047]

91 6 Conclusions It has been shown that given a sufďŹ ciently large training set, a single pass of 2SGD generalizes as well as the empirical optimum. [sent-1000, score-0.096]

92 Our results show that PSA provides a practical solution to accomplish near optimal performance of 2SGD as predicted theoretically for a variety of large scale models and tasks with a reasonably low cost per iteration compared to competing 2SGD methods. [sent-1001, score-0.021]

93 The beneďŹ t of 2SGD with PSA over plain SGD becomes clearer when the scale of the tasks are increasingly large. [sent-1002, score-0.092]

94 For non-convex neural network tasks, since the curvature of the error surface is so complex, it is still very challenging for an eigenvalue approximation method like PSA. [sent-1003, score-0.085]

95 Accelerated training of conditional random ďŹ elds with stochastic gradient methods. [sent-1017, score-0.091]

96 Exponentiated gradient algorithms for conditional random ďŹ elds and max-margin markov networks. [sent-1021, score-0.056]

97 Global and componentwise extrapolation for accelerating data mining from large incomplete data sets with the EM algorithm. [sent-1043, score-0.108]

98 Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random ďŹ elds. [sent-1046, score-0.107]

99 Biomedical named entity recognition using conditional random ďŹ elds and novel feature sets. [sent-1057, score-0.034]

100 Periodic step-size adaptation in second-order gradient descent for single-pass on-line structured learning. [sent-1097, score-0.052]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psa', 0.843), ('sgd', 0.274), ('eig', 0.2), ('crf', 0.154), ('smd', 0.149), ('svmsgd', 0.1), ('aitken', 0.071), ('plain', 0.071), ('trick', 0.062), ('jacobian', 0.061), ('pass', 0.061), ('biocreative', 0.057), ('hessian', 0.056), ('eigenvalue', 0.053), ('liblinear', 0.05), ('fd', 0.048), ('acceleration', 0.048), ('periodic', 0.046), ('yann', 0.046), ('cpu', 0.045), ('ocr', 0.043), ('basenp', 0.043), ('tonga', 0.043), ('bfgs', 0.041), ('taiwan', 0.038), ('mapping', 0.037), ('ls', 0.037), ('extrapolation', 0.037), ('batch', 0.036), ('training', 0.035), ('update', 0.034), ('componentwise', 0.034), ('chunking', 0.034), ('chang', 0.033), ('adaptation', 0.03), ('bottou', 0.03), ('sec', 0.029), ('optimum', 0.029), ('fedora', 0.029), ('lgpl', 0.029), ('lsvm', 0.029), ('taipei', 0.029), ('layer', 0.028), ('hsu', 0.026), ('layers', 0.025), ('aggregate', 0.025), ('orr', 0.025), ('huang', 0.025), ('lecun', 0.024), ('ran', 0.023), ('convergence', 0.023), ('intel', 0.023), ('svm', 0.023), ('gradient', 0.022), ('settings', 0.022), ('loss', 0.022), ('tasks', 0.021), ('passes', 0.021), ('optimizer', 0.02), ('periodically', 0.02), ('mod', 0.02), ('url', 0.02), ('mappings', 0.02), ('mnist', 0.019), ('pegasos', 0.019), ('yoshua', 0.019), ('disk', 0.019), ('conditional', 0.019), ('seconds', 0.019), ('accelerating', 0.019), ('extrapolate', 0.019), ('surface', 0.018), ('convolutional', 0.018), ('mining', 0.018), ('kong', 0.017), ('sgn', 0.017), ('approximates', 0.017), ('step', 0.017), ('usa', 0.017), ('hong', 0.016), ('http', 0.016), ('attempts', 0.016), ('score', 0.016), ('primal', 0.016), ('eigenvalues', 0.016), ('table', 0.015), ('np', 0.015), ('elds', 0.015), ('examples', 0.015), ('accuracy', 0.015), ('pc', 0.015), ('prohibitively', 0.015), ('december', 0.015), ('explores', 0.015), ('discount', 0.015), ('newton', 0.015), ('usually', 0.015), ('time', 0.014), ('network', 0.014), ('tutorial', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

2 0.092381746 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

3 0.053935621 56 nips-2009-Conditional Neural Fields

Author: Jian Peng, Liefeng Bo, Jinbo Xu

Abstract: Conditional random fields (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs significantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.

4 0.042960353 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

5 0.040404305 72 nips-2009-Distribution Matching for Transduction

Author: Novi Quadrianto, James Petterson, Alex J. Smola

Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1

6 0.040216215 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

7 0.033106182 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

8 0.033017837 220 nips-2009-Slow Learners are Fast

9 0.031091798 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

10 0.030747607 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

11 0.025683567 2 nips-2009-3D Object Recognition with Deep Belief Nets

12 0.025215304 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

13 0.024522595 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

14 0.02388645 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

15 0.023342595 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

16 0.022647502 128 nips-2009-Learning Non-Linear Combinations of Kernels

17 0.022546524 138 nips-2009-Learning with Compressible Priors

18 0.022012332 151 nips-2009-Measuring Invariances in Deep Networks

19 0.021302046 87 nips-2009-Exponential Family Graph Matching and Ranking

20 0.021259176 76 nips-2009-Efficient Learning using Forward-Backward Splitting


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.078), (1, 0.021), (2, -0.002), (3, 0.001), (4, 0.001), (5, -0.009), (6, -0.008), (7, 0.034), (8, -0.012), (9, 0.001), (10, 0.011), (11, 0.021), (12, -0.043), (13, 0.002), (14, -0.001), (15, 0.026), (16, 0.021), (17, -0.047), (18, -0.013), (19, -0.035), (20, -0.005), (21, -0.032), (22, 0.012), (23, 0.028), (24, -0.006), (25, 0.017), (26, 0.025), (27, 0.004), (28, 0.027), (29, 0.051), (30, -0.037), (31, -0.034), (32, -0.002), (33, 0.024), (34, -0.006), (35, 0.001), (36, -0.029), (37, -0.042), (38, 0.04), (39, 0.011), (40, -0.034), (41, -0.073), (42, 0.02), (43, 0.022), (44, -0.05), (45, 0.14), (46, -0.025), (47, 0.021), (48, -0.046), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86840719 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

2 0.64662904 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

3 0.63047409 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

4 0.58043635 76 nips-2009-Efficient Learning using Forward-Backward Splitting

Author: Yoram Singer, John C. Duchi

Abstract: We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We 2 also show how to construct efficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1

5 0.51766372 72 nips-2009-Distribution Matching for Transduction

Author: Novi Quadrianto, James Petterson, Alex J. Smola

Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1

6 0.50203133 56 nips-2009-Conditional Neural Fields

7 0.4917022 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

8 0.4745571 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

9 0.45388275 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

10 0.44191405 220 nips-2009-Slow Learners are Fast

11 0.43991855 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

12 0.4366056 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

13 0.43423927 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

14 0.43157807 33 nips-2009-Analysis of SVM with Indefinite Kernels

15 0.41570535 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

16 0.41471609 97 nips-2009-Free energy score space

17 0.38086334 236 nips-2009-Structured output regression for detection with partial truncation

18 0.37306991 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

19 0.36779901 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

20 0.35717925 2 nips-2009-3D Object Recognition with Deep Belief Nets


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.016), (10, 0.013), (24, 0.039), (25, 0.078), (35, 0.06), (36, 0.104), (39, 0.025), (54, 0.319), (58, 0.05), (61, 0.013), (71, 0.055), (81, 0.027), (86, 0.089), (91, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75148493 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee

Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1

2 0.70056266 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

Author: Marco Grzegorczyk, Dirk Husmeier

Abstract: Dynamic Bayesian networks have been applied widely to reconstruct the structure of regulatory processes from time series data. The standard approach is based on the assumption of a homogeneous Markov chain, which is not valid in many realworld scenarios. Recent research efforts addressing this shortcoming have considered undirected graphs, directed graphs for discretized data, or over-flexible models that lack any information sharing among time series segments. In the present article, we propose a non-stationary dynamic Bayesian network for continuous data, in which parameters are allowed to vary among segments, and in which a common network structure provides essential information sharing across segments. Our model is based on a Bayesian multiple change-point process, where the number and location of the change-points is sampled from the posterior distribution. 1

3 0.64404666 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher

Abstract: The replica method is a non-rigorous but widely-accepted technique from statistical physics used in the asymptotic analysis of large, random, nonlinear problems. This paper applies the replica method to non-Gaussian maximum a posteriori (MAP) estimation. It is shown that with random linear measurements and Gaussian noise, the asymptotic behavior of the MAP estimate of an n-dimensional vector “decouples” as n scalar MAP estimators. The result is a counterpart to Guo and Verd´ ’s replica analysis of minimum mean-squared error estimation. u The replica MAP analysis can be readily applied to many estimators used in compressed sensing, including basis pursuit, lasso, linear estimation with thresholding, and zero norm-regularized estimation. In the case of lasso estimation the scalar estimator reduces to a soft-thresholding operator, and for zero normregularized estimation it reduces to a hard-threshold. Among other benefits, the replica method provides a computationally-tractable method for exactly computing various performance metrics including mean-squared error and sparsity pattern recovery probability.

4 0.6265828 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

5 0.50600207 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

6 0.50049472 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

7 0.49921933 129 nips-2009-Learning a Small Mixture of Trees

8 0.49840724 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

9 0.49819997 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

10 0.49778292 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

11 0.49762011 97 nips-2009-Free energy score space

12 0.49523145 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

13 0.49480629 260 nips-2009-Zero-shot Learning with Semantic Output Codes

14 0.49327794 2 nips-2009-3D Object Recognition with Deep Belief Nets

15 0.49300849 220 nips-2009-Slow Learners are Fast

16 0.49263322 175 nips-2009-Occlusive Components Analysis

17 0.49199283 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora

18 0.49161565 71 nips-2009-Distribution-Calibrated Hierarchical Classification

19 0.49156687 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

20 0.49122906 137 nips-2009-Learning transport operators for image manifolds