nips nips2011 nips2011-45 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. [sent-5, score-0.613]
2 This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! [sent-6, score-0.761]
3 1 Introduction Stochastic approximation (online) approaches, such as stochastic gradient descent and stochastic dual averaging, have become the optimization method of choice for many learning problems, including linear SVMs. [sent-7, score-0.376]
4 This is not surprising, since such methods yield optimal generalization guarantees with only a single pass over the data. [sent-8, score-0.096]
5 They therefore in a sense have optimal, unbeatable runtime: from a learning (generalization) point of view, in a “data laden” setting [2, 13], the runtime to get to a desired generalization goal is the same as the size of the data set required to do so. [sent-9, score-0.653]
6 Their runtime is therefore equal (up to a small constant factor) to the runtime required to just read the data. [sent-10, score-0.887]
7 The key here, is that unlike online methods that consider an entire training vector at each iteration, our method accesses single features (coordinates) of training vectors. [sent-12, score-0.479]
8 Our computational model is thus that of random access to a desired coordinate of a desired training vector (as is standard for sublinear time algorithms), and our main computational cost are these feature accesses. [sent-13, score-0.318]
9 first theoretical guarantee on the number of feature accesses that is less then simply observing entire feature vectors). [sent-18, score-0.314]
10 We emphasize that our method is not online in nature, and we do require repeated access to training examples, but the resulting runtime (as well as the overall number of features accessed) is less (in some regimes) then for any online algorithms that considers entire training vectors. [sent-19, score-0.781]
11 As discussed in Section 3, our method is a primal-dual method, where both the primal and dual steps are stochastic. [sent-23, score-0.326]
12 The primal steps can be viewed as importance-weighted stochastic gradient descent, and the dual step as a stochastic update on the importance weighting, informed by the current primal solution. [sent-24, score-0.789]
13 This approach builds on the work of [4] that presented a sublinear time algorithm for approximating the margin of a linearly separable data set. [sent-25, score-0.16]
14 Here, we extend that work to the more rel1 evant noisy (non-separable) setting, and show how it can be applied to a learning problem, yielding generalization runtime better then SGD. [sent-26, score-0.472]
15 2 The SVM Optimization Problem We consider training a linear binary SVM based on a training set of n labeled points {xi , yi }i=1. [sent-28, score-0.174]
16 n , xi ∈ Rd , yi ∈ {±1}, with the data normalized such that xi ≤ 1. [sent-31, score-0.136]
17 A predictor is specified by w ∈ Rd and a bias b ∈ R. [sent-32, score-0.164]
18 In training, we wish to minimize the empirical error, measured in terms n 1 ˆ of the average hinge loss Rhinge (w, b) = n i=1 [1 − y( w, xi + b)]+ , and the norm of w. [sent-33, score-0.133]
19 Specifically, we introduce slack variables and consider the optimization problem: n max min w∈Rd , b∈R, 0≤ξi i∈[n] yi ( w, xi + b) + ξi s. [sent-37, score-0.206]
20 ξi ≤ nν w ≤ 1 and (3) i=1 where the parameter ν controls the trade-off between desiring a large margin (low norm) and small error (low slack), and parameterizes solutions along the regularization path. [sent-39, score-0.094]
21 Before proceeding, we also note that any solution of (1) that classifies at least some positive and negative points within the desired margin must have w ≥ 1 and so in Lemma 2. [sent-50, score-0.136]
22 3 Overview: Primal-Dual Algorithms and Our Approach The CHW framework The method of [4] applies to saddle-point problems of the form max min ci (z). [sent-53, score-0.143]
23 z∈K i∈[n] 2 (4) where ci (z) are concave functions of z over some set K ⊆ Rd . [sent-54, score-0.111]
24 The method is a stochastic primaldual method, where the dual solution can be viewed as importance weighting over the n terms ci (z). [sent-55, score-0.429]
25 To better understand this view, consider the equivalent problem: n max min z∈K p∈∆n pi ci (z) (5) i=1 where ∆n = {p ∈ Rn | pi ≥ 0, p 1 = 1} is the probability simplex. [sent-56, score-0.169]
26 The method maintains and (stochastically) improves both a primal solution (in our case, a predictor w ∈ Rd ) and a dual solution, which is a distribution p over [n]. [sent-57, score-0.45]
27 Stochastic primal update: (a) A term i ∈ [n] is chosen according to the distribution p, in time O(n). [sent-60, score-0.214]
28 (b) The primal variable z is updated according to the gradient of the ci (z), via an online low-regret update. [sent-61, score-0.43]
29 This update is in fact a Stochastic Gradient Descent (SGD) step on the objective of (5), as explained in section 4. [sent-62, score-0.096]
30 Since we use only a single term ci (z), this can be usually done in time O(d). [sent-63, score-0.138]
31 Stochastic dual update: (a) We obtain a stochastic estimate of ci (z), for each i ∈ [n]. [sent-65, score-0.316]
32 When the ci ’s are linear functions, this can be achieved using a form of 2 -sampling for estimating an inner-product in Rd . [sent-69, score-0.111]
33 (b) The distribution p is updated toward those terms with low estimated values of ci (z). [sent-70, score-0.186]
34 This is accomplished using a variant of the Multiplicative Updates (MW) framework for online optimization over the simplex (see for example [1]), adapted to our case in which the updates are based on random variables with bounded variance. [sent-71, score-0.197]
35 Evidently, the overall runtime per iteration is O(n + d). [sent-73, score-0.478]
36 In addition, the regret bounds on the updates of z and p can be used to bound the number of iterations required to reach an ε-suboptimal solution. [sent-74, score-0.202]
37 Hence, the CHW approach is particularly effective when this regret bound has a favorable dependence on d and n. [sent-75, score-0.108]
38 The PST framework The Plotkin-Shmoys-Tardos framework [10] is a deterministic primal-dual method, originally proposed for approximately solving certain types of linear programs known as “fractional packing and covering” problems. [sent-77, score-0.101]
39 These iterations yield convergence to the optimum of (5), and the regret bound of the MW updates is used to derive a convergence rate guarantee. [sent-80, score-0.215]
40 Since each iteration of the framework relies on the entire set of functions ci , it is reasonable to apply it only on relatively small-sized problems. [sent-81, score-0.21]
41 Indeed, in our application we shall use this method for the update of the slack variables ξ and the bias term b, for which the implied cost is only O(n) time. [sent-82, score-0.241]
42 Specifically, we suggest using a SGD-like low-regret 3 update for the variable w, while updating the variables ξ and b via a PST-like step; the dual update of our method is similar to that of CHW. [sent-88, score-0.281]
43 For the simplicity of presentation, we omit the bias term for now (i. [sent-93, score-0.096]
44 , fix b = 0 in (3)) and later explain how adding such bias to our framework is almost immediate and does not affect the analysis. [sent-95, score-0.101]
45 This allows us to ignore the labels yi , by setting xi ← −xi for any i with yi = −1. [sent-96, score-0.128]
46 Finally, we stack the training instances xi as the rows of a matrix X ∈ Rn×d , although we treat each xi as a column vector. [sent-101, score-0.192]
47 Let T ← 1002 ε−2 log n, η ← log(n)/T and u1 ← 0, q1 ← 1n for t = 1 to T do Choose it ← i with probability pt (i) √ Let ut ← ut−1 + xit / 2T , ξt ← arg maxξ∈Ξν (pt ξ) wt ← ut / max{1, ut } Choose jt ← j with probability wt (j)2 / wt 2 . [sent-103, score-1.045]
48 for i = 1 to n do vt (i) ← xi (jt ) wt 2 /wt (jt ) + ξt (i) ˜ vt (i) ← clip(˜t (i), 1/η) v 2 qt+1 (i) ← qt (i)(1 − ηvt (i) + η 2 vt (i) ) end for pt ← qt / qt 1 end for 1 ¯ 1 ¯ return w = T t wt , ξ = T t ξt The pseudo-code of the SIMBA algorithm is given in figure 1. [sent-104, score-1.116]
49 In the primal part (lines 4 through 6), the vector ut is updated by adding an instance xi , randomly chosen according to the distribution pt . [sent-105, score-0.56]
50 This is a version of SGD applied on the function pt (Xw + ξt ), whose gradient with respect to w is pt X; by the sampling procedure of it , the vector xit is an unbiased estimator of this gradient. [sent-106, score-0.535]
51 The vector ut is then projected onto the unit ball, to obtain wt . [sent-107, score-0.217]
52 On the other hand, the primal variable ξt is updated by a complete optimization of pt ξ with respect to ξ ∈ Ξν . [sent-108, score-0.489]
53 In the dual part (lines 7 through 13), the algorithm first updates the vector qt using the jt column of X 2 and the value of wt (jt ), where jt is randomly selected according to the distribution wt / wt 2 . [sent-111, score-0.922]
54 Before stating the main theorem, we describe in detail the MW algorithm we use for the dual update. [sent-117, score-0.139]
55 Let w1 ← 1n , and for t ≥ 1, pt ← wt / wt 1 , wt+1 (i) ← wt (i)(1 − ηvt (i) + η 2 vt (i)2 ). [sent-125, score-0.785]
56 and (6) The following lemma establishes a regret bound for the Variance MW algorithm. [sent-126, score-0.159]
57 The Variance MW algorithm satisfies pt vt ≤ min i∈[n] t∈[T ] max{vt (i), −1/η} + t∈[T ] log n +η η 2 pt vt . [sent-129, score-0.696]
58 The main idea of the proof is to establish lower and upper bounds on the average 1 objective value T t∈[T ] pt (Xwt + ξt ). [sent-137, score-0.238]
59 For the lower bound, we consider the primal part of the algorithm. [sent-140, score-0.187]
60 Noting that t∈[T ] pt ξt ≥ ∗ t∈[T ] pt ξ (which follows from the PST step) and employing a standard regret guarantee for 1 bounding the regret of the SGD update, we obtain the lower bound (with probability ≥ 1 − O( n )): 1 T pt (Xwt + ξt ) ≥ γ ∗ − O log n T . [sent-141, score-0.894]
61 t∈[T ] For the upper bound, we examine the dual part of the algorithm. [sent-142, score-0.139]
62 2 for bounding 1 3 the regret of the MW update, we get the following upper bound (with probability > 4 − O( n )): 1 T pt (Xwt + ξt ) ≤ t∈[T ] 1 min T i∈[n] [xi wt + ξt (i)] + O log n T . [sent-144, score-0.497]
63 In each iteration, the update of the vectors wt and pt takes O(d) and O(n) time respectively, while ξt can be computed ˜ in O(n) time as explained above. [sent-148, score-0.434]
64 Incorporating a bias term We return to the optimization problem (3) presented in section 2, and show how the bias term b can be integrated into our algorithm. [sent-150, score-0.236]
65 Unlike with SGD-based approaches, including the bias term in our framework is straightforward. [sent-151, score-0.128]
66 Notice that we b still assume that the labels yi were subsumed into the instances xi , as in section 4. [sent-154, score-0.171]
67 The update of bt , on the other hand, admits a simple, closed-form formula: bt = sign(pt y). [sent-156, score-0.237]
68 1, we see that for any Pareto optimal point w∗ of ˆ ˆ (1) with w∗ = B and Rhinge (w∗ ) = R∗ , the runtime required for our method to find a predictor ∗ ˆ ˆ ˆ with w ≤ 2B and Rhinge (w) ≤ R + δ is ˆ ˆ R∗ + δ ˜ O B 2 (n + d) ˆ δ 2 . [sent-165, score-0.55]
69 (7) This guarantee is rather different from guarantee for other SVM optimization approaches. [sent-166, score-0.124]
70 using a stochastic gradient descent (SGD) approach, we could find a predictor with w ≤ B and ˆ ˆ ˆ ˆ Rhinge (w) ≤ R∗ + δ in time O(B 2 d/δ 2 ). [sent-169, score-0.222]
71 Compared with SGD, we only ensure a constant factor approximation to the norm, and our runtime does depend on the training set size n, but the dependence ˆ on δ is more favorable. [sent-170, score-0.496]
72 Following [13], instead of comparing the runtime to achieve a certain optimization accuracy on the empirical optimization problem, we analyze the runtime to achieve a desired generalization performance. [sent-172, score-1.012]
73 Recall that our true learning objective is to find a predictor with low generalization error Rerr (w) = Pr(x,y) (y w, x ≤ 0) where x, y are distributed according to some unknown source distribution, and the training set is drawn i. [sent-173, score-0.322]
74 We assume that there exists some (unknown) predictor w∗ that has norm w∗ ≤ B and low expected hinge loss R∗ = Rhinge (w∗ ) = E [[1 − y w∗ , x ]+ ], and analyze the runtime to find a predictor w with generalization error Rerr (w) ≤ R∗ + δ. [sent-177, score-0.813]
75 In order to understand the runtime from this perspective, we must consider the required sample size ˆ to obtain generalization to within δ, as well as the required suboptimality for w and Rhinge (w). [sent-178, score-0.602]
76 But since, as we will see, our analysis will be sensitive to the value of R∗ , we will consider a more refined generalization guarantee which gives a better rate when R∗ is small relative to δ. [sent-180, score-0.109]
77 Following Theorem 5 of [14] (and recalling that the hinge-loss is an upper bound on margin violations), we have that with high probability over a sample of size n, for all predictors w: 2 ˆ 2 w Rhinge (w) w ˆ Rerr (w) ≤ Rhinge (w) + O + . [sent-181, score-0.111]
78 (8) n n This implies that a training set of size ˜ n=O B 2 R∗ + δ · δ δ (9) is enough for generalization to within δ. [sent-182, score-0.162]
79 We will be mostly concerned here with the regime where either R∗ is small and we seek generalization to within δ = Ω(R∗ )—a typical regime in learning. [sent-183, score-0.199]
80 This is always the case in the realizable setting, where R∗ = 0, but includes also the non-realizable setting, as long as the desired estimation error δ is not much smaller then the unavoidable error R∗ . [sent-184, score-0.121]
81 In fact, an online approach 1 can find a predictor with Rerr (w) ≤ R∗ + δ with a single pass over ˜ n = O(B 2 /δ · (δ + R∗ )/δ) training points. [sent-186, score-0.216]
82 6 required to read the training point), the overall runtime is: O B2 R∗ + δ d· δ δ . [sent-188, score-0.591]
83 (10) Returning to our approach, approximating the norm to within a factor of two is fine, as it only effects the required sample size, and hence the runtime by a constant factor. [sent-189, score-0.505]
84 Plugging this into the runtime analysis (7) yields: Corollary 5. [sent-191, score-0.403]
85 For any B ≥ 1 and δ > 0, with high probability over a training set of size, n = ˜ O(B 2 /δ · (δ + R∗ )/δ), Algorithm 1 outputs a predictor w with Rerr (w) ≤ R∗ + δ in time ˜ O where R∗ = inf w∗ ≤B B2d + B 4 δ + R∗ · δ δ · δ + R∗ δ 2 Rhinge (w∗ ). [sent-193, score-0.162]
86 Let us compare the above runtime to the online runtime (10), focusing on the regime where R∗ is ∗ ˜ small and δ = Ω(R∗ ) and so R δ+δ = O(1), and ignoring the logarithmic factors hidden in the O(·) notation in Corollary 5. [sent-194, score-0.925]
87 To do so, we will first rewrite the runtime in Corollary 5. [sent-196, score-0.403]
88 Now, the first term in (11) is more directly comparable to the online runtime (10), and is always smaller by a factor of (R∗ + δ) ≤ 1. [sent-203, score-0.484]
89 We see, then, that our proposed approach can yield a significant reduction in runtime when the resulting error rate is small. [sent-205, score-0.466]
90 Returning to the form of the runtime in Corollary 5. [sent-207, score-0.403]
91 1, we can also understand the runtime as follows: Initially, a runtime of O(B 2 d) is required in order for the estimates of w and p to start being reasonable. [sent-208, score-0.858]
92 15 107 108 feature accesses 109 109 1010 feature accesses 1011 Figure 1: The test error, averaged over 10 repetitions, vs. [sent-233, score-0.432]
93 Note that our algorithm assumes random access to features (as opposed to instances), thus it is not meaningful to compare the test error as a function of the number of iterations of each algorithm. [sent-242, score-0.1]
94 Instead, and according to our computational model, we compare the test error as a function of the number of feature accesses of each algorithm. [sent-243, score-0.252]
95 • Explicitly introducing the slack variables (which are not typically represented in primal SGD approaches). [sent-248, score-0.261]
96 This differs from heuristic importance weighting approaches for stochastic learning, which tend to focus on all samples with a non-zero loss gradient. [sent-250, score-0.15]
97 The ideas can also be extended to kernels, either through linearization [11], using an implicit linearization as in [4], or through a representation approach. [sent-253, score-0.108]
98 Beyond SVMs, the framework can apply more broadly, whenever we have a low-regret method for the primal problem, and a sampling procedure for the dual updates. [sent-254, score-0.358]
99 The multiplicative weights update method: a meta algorithm and applications. [sent-264, score-0.101]
100 Fast approximation algorithms for fractional packing and covering problems. [sent-324, score-0.094]
wordName wordTfidf (topN-words)
[('rhinge', 0.405), ('runtime', 0.403), ('sgd', 0.216), ('mw', 0.213), ('pt', 0.213), ('accesses', 0.19), ('primal', 0.187), ('rerr', 0.162), ('wt', 0.15), ('simba', 0.142), ('dual', 0.139), ('chw', 0.135), ('xwt', 0.135), ('vt', 0.122), ('jt', 0.114), ('ci', 0.111), ('pst', 0.109), ('sublinear', 0.102), ('pareto', 0.097), ('predictor', 0.095), ('pegasos', 0.089), ('bt', 0.083), ('regret', 0.081), ('svms', 0.078), ('slack', 0.074), ('svm', 0.074), ('update', 0.071), ('bias', 0.069), ('generalization', 0.069), ('training', 0.067), ('ut', 0.067), ('stochastic', 0.066), ('regime', 0.065), ('qt', 0.063), ('rd', 0.06), ('margin', 0.058), ('evidently', 0.054), ('subsumed', 0.054), ('unbeatable', 0.054), ('online', 0.054), ('required', 0.052), ('lemma', 0.051), ('norm', 0.05), ('bd', 0.05), ('desired', 0.049), ('xi', 0.048), ('returning', 0.046), ('updated', 0.045), ('optimization', 0.044), ('weighting', 0.044), ('hazan', 0.044), ('clip', 0.044), ('ideas', 0.042), ('updates', 0.042), ('xit', 0.041), ('yi', 0.04), ('importance', 0.04), ('overall', 0.04), ('guarantee', 0.04), ('budgeted', 0.039), ('features', 0.039), ('optimum', 0.038), ('packing', 0.037), ('corollary', 0.037), ('error', 0.036), ('affecting', 0.036), ('hinge', 0.035), ('iteration', 0.035), ('estimator', 0.035), ('fractional', 0.033), ('linearization', 0.033), ('gradient', 0.033), ('stochastically', 0.032), ('framework', 0.032), ('entire', 0.032), ('rn', 0.03), ('unlike', 0.03), ('multiplicative', 0.03), ('low', 0.03), ('variance', 0.029), ('solution', 0.029), ('instances', 0.029), ('pi', 0.029), ('read', 0.029), ('israel', 0.029), ('repetitions', 0.028), ('descent', 0.028), ('yield', 0.027), ('srebro', 0.027), ('term', 0.027), ('bound', 0.027), ('feature', 0.026), ('log', 0.026), ('size', 0.026), ('access', 0.025), ('hybrid', 0.025), ('speaking', 0.025), ('objective', 0.025), ('accomplished', 0.025), ('covering', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
2 0.20519039 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
Author: Dan Garber, Elad Hazan
Abstract: In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1
3 0.19068058 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
4 0.18674001 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
5 0.13348398 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
Author: Elad Hazan, Satyen Kale
Abstract: We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to infinity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1
6 0.1244158 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
7 0.11973131 80 nips-2011-Efficient Online Learning via Randomized Rounding
8 0.11471771 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
9 0.10624897 202 nips-2011-On the Universality of Online Mirror Descent
10 0.1017798 220 nips-2011-Prediction strategies without loss
11 0.091441482 282 nips-2011-The Fast Convergence of Boosting
12 0.087207615 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
13 0.086221829 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
14 0.082939893 178 nips-2011-Multiclass Boosting: Theory and Algorithms
15 0.082061604 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
16 0.080296427 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
17 0.080000542 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
18 0.078260072 145 nips-2011-Learning Eigenvectors for Free
19 0.078177951 25 nips-2011-Adaptive Hedge
20 0.076030515 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
topicId topicWeight
[(0, 0.235), (1, -0.134), (2, -0.066), (3, -0.094), (4, 0.095), (5, 0.063), (6, 0.059), (7, -0.061), (8, -0.157), (9, -0.0), (10, 0.135), (11, -0.098), (12, -0.034), (13, -0.1), (14, -0.062), (15, 0.157), (16, -0.071), (17, 0.099), (18, 0.011), (19, 0.055), (20, -0.032), (21, -0.015), (22, -0.087), (23, -0.01), (24, -0.011), (25, 0.081), (26, 0.009), (27, 0.034), (28, -0.029), (29, 0.043), (30, 0.002), (31, 0.001), (32, 0.002), (33, -0.009), (34, 0.022), (35, 0.042), (36, -0.037), (37, 0.026), (38, 0.111), (39, -0.009), (40, 0.042), (41, -0.032), (42, 0.072), (43, -0.074), (44, -0.067), (45, 0.043), (46, -0.039), (47, -0.041), (48, -0.007), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.92210335 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
2 0.80837309 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
3 0.76552445 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
4 0.71990919 202 nips-2011-On the Universality of Online Mirror Descent
Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1
5 0.71525288 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
6 0.68992233 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
7 0.6375801 80 nips-2011-Efficient Online Learning via Randomized Rounding
8 0.61290592 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
9 0.6115182 72 nips-2011-Distributed Delayed Stochastic Optimization
10 0.53702325 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
11 0.53591442 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
12 0.49487579 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
13 0.48912612 4 nips-2011-A Convergence Analysis of Log-Linear Training
14 0.46398064 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
15 0.4623107 272 nips-2011-Stochastic convex optimization with bandit feedback
16 0.45313141 181 nips-2011-Multiple Instance Learning on Structured Data
17 0.45179546 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
18 0.44687128 282 nips-2011-The Fast Convergence of Boosting
19 0.44676274 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
20 0.43921098 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
topicId topicWeight
[(0, 0.035), (4, 0.091), (20, 0.037), (26, 0.033), (31, 0.077), (33, 0.016), (43, 0.072), (45, 0.161), (57, 0.028), (65, 0.013), (74, 0.058), (83, 0.039), (89, 0.17), (99, 0.104)]
simIndex simValue paperId paperTitle
1 0.9508363 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions
Author: Mohammad S. Sorower, Janardhan R. Doppa, Walker Orr, Prasad Tadepalli, Thomas G. Dietterich, Xiaoli Z. Fern
Abstract: We consider the problem of learning rules from natural language text sources. These sources, such as news articles and web texts, are created by a writer to communicate information to a reader, where the writer and reader share substantial domain knowledge. Consequently, the texts tend to be concise and mention the minimum information necessary for the reader to draw the correct conclusions. We study the problem of learning domain knowledge from such concise texts, which is an instance of the general problem of learning in the presence of missing data. However, unlike standard approaches to missing data, in this setting we know that facts are more likely to be missing from the text in cases where the reader can infer them from the facts that are mentioned combined with the domain knowledge. Hence, we can explicitly model this “missingness” process and invert it via probabilistic inference to learn the underlying domain knowledge. This paper introduces a mention model that models the probability of facts being mentioned in the text based on what other facts have already been mentioned and domain knowledge in the form of Horn clause rules. Learning must simultaneously search the space of rules and learn the parameters of the mention model. We accomplish this via an application of Expectation Maximization within a Markov Logic framework. An experimental evaluation on synthetic and natural text data shows that the method can learn accurate rules and apply them to new texts to make correct inferences. Experiments also show that the method out-performs the standard EM approach that assumes mentions are missing at random. 1
2 0.92287338 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs
Author: Yusuke Watanabe
Abstract: While loopy Belief Propagation (LBP) has been utilized in a wide variety of applications with empirical success, it comes with few theoretical guarantees. Especially, if the interactions of random variables in a graphical model are strong, the behaviors of the algorithm can be difficult to analyze due to underlying phase transitions. In this paper, we develop a novel approach to the uniqueness problem of the LBP fixed point; our new “necessary and sufficient” condition is stated in terms of graphs and signs, where the sign denotes the types (attractive/repulsive) of the interaction (i.e., compatibility function) on the edge. In all previous works, uniqueness is guaranteed only in the situations where the strength of the interactions are “sufficiently” small in certain senses. In contrast, our condition covers arbitrary strong interactions on the specified class of signed graphs. The result of this paper is based on the recent theoretical advance in the LBP algorithm; the connection with the graph zeta function.
3 0.89196479 217 nips-2011-Practical Variational Inference for Neural Networks
Author: Alex Graves
Abstract: Variational methods have been previously explored as a tractable approximation to Bayesian inference for neural networks. However the approaches proposed so far have only been applicable to a few simple network architectures. This paper introduces an easy-to-implement stochastic variational method (or equivalently, minimum description length loss function) that can be applied to most neural networks. Along the way it revisits several common regularisers from a variational perspective. It also provides a simple pruning heuristic that can both drastically reduce the number of network weights and lead to improved generalisation. Experimental results are provided for a hierarchical multidimensional recurrent neural network applied to the TIMIT speech corpus. 1
same-paper 4 0.85883921 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
5 0.804263 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
6 0.80246282 186 nips-2011-Noise Thresholds for Spectral Clustering
7 0.79979551 29 nips-2011-Algorithms and hardness results for parallel large margin learning
8 0.79654729 80 nips-2011-Efficient Online Learning via Randomized Rounding
9 0.78789389 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
10 0.78596926 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
11 0.78304893 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
12 0.77976418 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
13 0.77807659 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
14 0.77644694 202 nips-2011-On the Universality of Online Mirror Descent
15 0.77615094 271 nips-2011-Statistical Tests for Optimization Efficiency
16 0.77463722 150 nips-2011-Learning a Distance Metric from a Network
17 0.7741071 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
18 0.77372718 64 nips-2011-Convergent Bounds on the Euclidean Distance
19 0.77172226 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.7715711 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity