nips nips2011 nips2011-271 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. [sent-3, score-0.201]
2 We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. [sent-4, score-0.262]
3 By considering the statistical properties of the update variables used during the optimization (e. [sent-5, score-0.219]
4 gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. [sent-7, score-0.479]
5 We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. [sent-8, score-0.294]
6 Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. [sent-10, score-0.222]
7 In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. [sent-11, score-0.329]
8 The most important feature we will exploit is the fact that the statistical properties of an estimation problem determine an intrinsic scale of precision (that is usually much larger than machine precision). [sent-15, score-0.231]
9 Besides a natural stopping criterion it also leads to much faster optimization before we reach that scale by realizing that far away from optimality we need much less precision to determine a parameter update than when close to optimality. [sent-17, score-0.423]
10 One of the important conclusions was that a not so impressive optimization algorithm such as stochastic gradient descent (SGD) can be nevertheless a very good learning algorithm because it can process more data per unit time. [sent-20, score-0.314]
11 In a frequentist world we may ask how different the value of the loss would have been if we would have sampled another dataset of the same size from a single shared underlying distribution. [sent-28, score-0.221]
12 The role of an optimization algorithm is then to propose parameter updates that will be accepted or rejected on statistical grounds. [sent-29, score-0.24]
13 The test we propose determines whether the direction of a parameter update is correct with high probability. [sent-30, score-0.255]
14 If we do not pass our tests when using all the available data-cases then we stop learning (or alternatively we switch to sampling or bagging), because we have reached the intrinsic scale of precision set by the statistical properties of the estimation problem. [sent-31, score-0.554]
15 However, we can use the same tests to speed up the optimization process itself, that is before we reach the above stopping criterion. [sent-32, score-0.279]
16 In batch mode, using the whole (infinite) dataset, one would not take a single optimization step in finite time. [sent-34, score-0.169]
17 Our algorithm adaptively grows a subset of the data by requiring that we have just enough precision to confidently move in the correct direction. [sent-40, score-0.166]
18 For instance, gradient descent falls in this class, as the gradient is defined by an average (or sum). [sent-44, score-0.275]
19 As in [11], with large enough batch sizes we can use the Central Limit Theorem to claim that the average gradients are normally distributed and estimate their variance without actually seeing more data (this assumption is empirically verified in section 5. [sent-45, score-0.179]
20 We have furthermore implemented methods to avoid testing updates for parameters which are likely to fail their test. [sent-47, score-0.151]
21 • They depend on a single interpretable parameter ǫ – the probability to update parameters in the wrong direction. [sent-52, score-0.172]
22 The algorithms terminate when the probability to update the parameters in the wrong direction can not be made smaller than ǫ. [sent-55, score-0.222]
23 Throughout the learning process they determine the size of the data subset required to perform updates that move in the correct direction with probability at least 1 − ǫ. [sent-61, score-0.233]
24 In this paper we show how these considerations can be applied to L1 -regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and Lasso [9]. [sent-63, score-0.329]
25 Coordinate descent algorithms are convenient because they do not require any tuning of hyper-parameters to be effective, and are still efficient when training sparse models. [sent-64, score-0.134]
26 In section 2 we review the coordinate descent algorithms. [sent-66, score-0.246]
27 2 we formulate our hypothesis testing framework, followed by a heuristic for predicting hypothesis test failures in section 4. [sent-68, score-0.509]
28 Notably, the partial derivatives are functions of statistical averages computed over N training points. [sent-74, score-0.18]
29 We show that one can use frequentist hypothesis tests to elegantly manage the amount of data needed (N ) to reliably compute these quantities. [sent-75, score-0.399]
30 We write xij for the j th element of datapoint xi . [sent-78, score-0.218]
31 2 L1 -regularized Logistic Regression Using a log-loss function in (1), we obtain a L1 -regularized logistic regression model: losslog = log(1 + e−yi β T xi ) (7) Appendix G of [10] derive the corresponding partial derivatives: L′ (0, β) = j 2. [sent-81, score-0.213]
32 We can use this expression as an estimator for β from a 1 1 dataset {xi , yi }. [sent-83, score-0.152]
33 The above update rule assumes standardized data ( N i xij = 0, N i x2 = 1), ij but it is straightforward to extend for the general case. [sent-84, score-0.37]
34 3 Hypothesis Testing new Each update βj = βj + dj is computed using a statistical average over a batch of N training points. [sent-85, score-0.548]
35 We wish to estimate the reliability of an update as a function of N . [sent-86, score-0.176]
36 This also makes the proposed updates dj and βj random variables because they are functions of the training points. [sent-88, score-0.382]
37 βj , dj , xij , yi and their instantiations, ˆnew ˆ ˆ ˆ βj , dj , xij , yi . [sent-91, score-1.03]
38 We would like to determine whether or not a particular update is statistically justified. [sent-92, score-0.162]
39 To this end, we use hypothesis tests where if there is high uncertainty in the direction of the update, we say this update is not justified and the update is not performed. [sent-93, score-0.602]
40 For example, if our new ˆnew proposed update βj is positive, we want to ensure that P (βj < 0) is small. [sent-94, score-0.164]
41 1 Algorithm Overview We propose a “growing batch” algorithm for handling very large or infinite datasets: first we select a very small subsample of the data of size Nb ≪ N , and optimize until the entire set of parameters are failing their hypothesis tests (described in more detail below). [sent-96, score-0.33]
42 We then query more data points and include them in our batch, reducing the variance of our estimates and making it more likely that they will pass their tests. [sent-97, score-0.158]
43 We continue adding data to our batch until we are using the full dataset of size N . [sent-98, score-0.167]
44 Once all of the parameters are failing their hypothesis tests on the full batch of data, we stop training. [sent-99, score-0.495]
45 Thus, our algorithm behaves like a stochastic online algorithm during early stages and like a batch algorithm during later stages, equipped with a natural stopping condition. [sent-101, score-0.308]
46 In our experiments, we increase the batch size Nb by a factor of 10 once all parameters fail their hypothesis tests for a given batch. [sent-102, score-0.409]
47 2 Lasso For quadratic loss functions with standardized variables, we can directly analyze the densities of new dj , βj . [sent-105, score-0.335]
48 We accept an update if the sign of dj can be estimated with sufficient probability. [sent-106, score-0.416]
49 We rewrite it as: αj = 1 N N zij (β) where (j) zij (β) = xij (yi − yi ) ˜ (12) i=1 Because zij (β) are given by a fixed transformation of the iid training points, they themselves are iid. [sent-108, score-0.978]
50 So, for any given αj , we can provide estimates 1 1 2 ˆ zij ˆ V ar(zij ) ≈ σzj = (ˆij − zj )2 z ˆ (13) E[zij ] ≈ zj = ˆ N i N −1 i 4 AP and Time responses to ε (LR on INRIA dataset) Q−Q Plots of Gradient Distributions 4 0. [sent-112, score-0.307]
51 (middle) Q-Q plot demonstrating the normality of the gradients on the L1 -regularized L2 -loss SVM, computed at various stages of the algorithm (i. [sent-143, score-0.198]
52 (right ) Plot showing the behavior of our algorithm with respect to ǫ, using logistic regression on the INRIA dataset. [sent-147, score-0.133]
53 5 corresponds to an algorithm which always updates (with no stopping criteria), so for these experiments ǫ was chosen in the range [. [sent-149, score-0.191]
54 Let dj be the realization of the random variable new ˆ dj = βj − βj , computed from the sample batch of N training points. [sent-157, score-0.659]
55 If dj > 0, then we want ˆj > 0, we want P (dj ≤ 0) < ǫ, where P (dj ≤ 0) to be small, and vice versa. [sent-158, score-0.307]
56 Similarly, one can ˆ define an analgous test of P (dj ≥ 0) < ǫ for dj < 0. [sent-161, score-0.277]
57 These are the hypothesis test equations for a single coordinate, so this test is performed once for each coordinate at its iteration in the coordinate descent algorithm. [sent-162, score-0.615]
58 If a coordinate update fails its test, then we assume that we do not have enough evidence to perform an update on the coordinate, and do not update. [sent-163, score-0.401]
59 3 Gradient-Based Hypothesis Tests new For general convex loss functions, it is difficult to construct a pdf for dj and βj . [sent-167, score-0.349]
60 Instead, we new accept an update βj if the sign of the partial derivative ∂f (β) can be estimated with sufficient ∂βj reliability. [sent-168, score-0.219]
61 The minimal (in magnitude) subgradient gj , associated with the flatest lower tangent, is: N if βj < 0 αj − γ 1 if βj > 0 g j = αj + γ where αj = L′ (0, β) = zij (15) j N i=1 S(α , γ) otherwise j x ij for log-loss. [sent-170, score-0.373]
62 where zij (β) = −2yi xij bi (β) for the squared hinge-loss and zij (β) = T 1+eyi β xi Appealing to the same arguements as in Sec. [sent-171, score-0.713]
63 To formulate our hypothesis test, we write gj as the realization of random variable gj , computed ˆ from the batch of N training points. [sent-175, score-0.612]
64 We want to take an update only if our update is in the correct 5 SVM Algorithm Comparison on the INRIA dataset SVM Algorithm Comparison on the VOC dataset Logistic Regression Algorithm Comparison on the INRIA Dataset 0. [sent-176, score-0.433]
65 “CD-Full” denotes our method using all applicable heuristic speedups, “CD-Hyp Testing” does not use the shrinking heuristic while “vanilla CD” simply performs coordinate descent without any speedup methods. [sent-202, score-0.454]
66 “SGD” is stochastic gradient descent with an annealing schedule. [sent-203, score-0.306]
67 Optimization of the hyper-parameters of the annealing schedule (on train data) was not included in the total runtime. [sent-204, score-0.176]
68 Note that our method achieves the optimal precision faster than SGD and also stops learning approximately when overfitting sets in. [sent-205, score-0.13]
69 direction with high probability: for gj > 0, we want P (gj ≤ 0) < ǫ, where ˆ Φ 0−(µαj −γ) if βj ≤ 0 σα j P (gj ≤ 0) = 0−(µαj +γ) Φ if βj > 0 σα (16) j We can likewise define a test of P (gj ≥ 0) < ǫ which we use to accept updates given a negative estimated gradient gj < 0. [sent-206, score-0.617]
70 ˆ 4 Additional Speedups It often occurs that many coordinates will fail their respective hypothesis tests for several consecutive iterations, so predicting these consecutive failures and skipping computations on these coordinates could potentially save computation. [sent-207, score-0.374]
71 In our case, we wish to predict when the gradient will have moved to a point where the associated hypothesis test will pass. [sent-214, score-0.272]
72 We wish to detect when the gradient will result in the hypothesis test passing, that is, we want to find the values µα ≈ α, ˆ where α is a realization of the random variable α, such that P (g ≥ 0) = ǫ or P (g ≤ 0) = ǫ. [sent-216, score-0.346]
73 For ˆ this purpose, we need to draw the distinction between an update which was taken, and one which is proposed but for which the hypothesis test failed. [sent-217, score-0.315]
74 Let the set of accepted updates be indexed by t, as in gt , and let the set of potential updates, after an accepted update at time t, be indexed by s, ˆ as in gt (s). [sent-218, score-0.595]
75 Thus the algorithm described in the previous section will compute gt (1). [sent-219, score-0.14]
76 ˆt (s∗ ) until ˆ ˆ g the hypothesis test passes for gt (s∗ ), and we then set gt+1 (0) = gt (s∗ ), and perform an update to ˆ ˆ ˆ β using gt+1 (0). [sent-222, score-0.595]
77 ˆt (s∗ − 1) at all, and instead only ˆ ˆ g compute the gradient when we know the hypothesis test will pass, s∗ iterations after the last accept. [sent-226, score-0.272]
78 Given that we have some scheme from skipping k iterations, we estimate a “velocity” at which ˆ ˆ gt (s) = αt (s) changes: ∆e ≡ αt (s)−αt (s−k−1) . [sent-227, score-0.186]
79 Note that all algorithms have very similar precision scores in the interval [0. [sent-263, score-0.13]
80 1 Shrinking Strategy It is common in SVM algorithms to employ a “shrinking” strategy in which datapoints which do not contribute to the loss are removed from future computations. [sent-278, score-0.131]
81 Specifically, if a data point (xi , yi ) has the property that bi = 1 − yi β T xi < ǫshrink < 0, for some ǫshrink , then the data point is removed from the current batch. [sent-279, score-0.289]
82 We employ this heuristic in our SVM implementation, and Figure 2 shows the relative performance between including this heuristic and not. [sent-281, score-0.164]
83 For both datasets, we measure performance using the standard PASCAL evaluation protocol of average precision (with 50% overlap of predicted/ground truth bounding boxes). [sent-286, score-0.13]
84 On such large training sets, one would expect delicately-tuned stochastic online algorithms (such as SGD) to outperform standard batch optimization (such as coordinate descent). [sent-287, score-0.408]
85 For these experiments, we focus on the gradients of the L1 -regularized, L2 -loss SVM computed during various stages of the optimization process. [sent-292, score-0.169]
86 Figures 2 show a comparison between our method and stochastic gradient descent on the INRIA and VOC datasets. [sent-303, score-0.223]
87 Our method including the shrinking strategy is faster for the SVM, while methods without a data shrinking strategy, such as logistic regression, are still competitive (see Figure 2). [sent-304, score-0.264]
88 In comparing our methods to the coordinate descent upon which ours are based, we see that our framework provides a considerable speedup over standard coordinate descent. [sent-305, score-0.389]
89 We do this with a method which eventually uses the entire batch of data, so the tricks that enable SGD to converge in an L1 -regularized problem are not necessary. [sent-306, score-0.151]
90 To further demonstrate the robustness of our method to ǫ, we performed 5 trials of logistic regression on the INRIA dataset with a wide range of values of ǫ, with random initializations, shown in Figure 1. [sent-319, score-0.185]
91 6 Conclusions We have introduced a new framework for optimization problems from a statistical, frequentist point of view. [sent-322, score-0.159]
92 In fact, we argue that when we are using all of our data and cannot determine with statistical confidence that our update is in the correct direction, we should stop learning to avoid overfitting. [sent-325, score-0.284]
93 maximum likelihood) and learning-theory approaches which formulate learning as the optimization of some loss function. [sent-329, score-0.157]
94 By predicting when updates will pass their statistical tests, we can update each feature approximately with the correct frequency. [sent-337, score-0.473]
95 However, the variable has a clear meaning – the allowed probability that an update moves in the wrong direction. [sent-339, score-0.172]
96 Our method is not limited to L1 methods or linear models; our framework can be used on any algorithm in which we take updates which are simple functions on averages over the data. [sent-342, score-0.147]
97 Relative to vanilla coordinate descent, our algorithms can handle dense datasets with N >> p. [sent-343, score-0.218]
98 Relative to SGD2 our method can be thought of as “self-annealing” in the sense that it increases its precision by adaptively increasing the dataset size. [sent-344, score-0.182]
99 The advantages over SGD are therefore that we avoid tuning hyper-parameters of an annealing schedule and that we have an automated stopping criterion. [sent-345, score-0.253]
100 Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. [sent-441, score-0.22]
wordName wordTfidf (topN-words)
[('sgd', 0.513), ('dj', 0.237), ('zij', 0.223), ('xij', 0.178), ('inria', 0.174), ('pass', 0.158), ('tests', 0.148), ('hypothesis', 0.146), ('coordinate', 0.143), ('gt', 0.14), ('precision', 0.13), ('update', 0.129), ('gj', 0.121), ('cd', 0.118), ('batch', 0.115), ('updates', 0.114), ('frequentist', 0.105), ('dej', 0.104), ('descent', 0.103), ('yi', 0.1), ('schedule', 0.093), ('eyi', 0.091), ('gradient', 0.086), ('svm', 0.085), ('unregularized', 0.084), ('logistic', 0.083), ('annealing', 0.083), ('normality', 0.083), ('lasso', 0.082), ('voc', 0.078), ('stopping', 0.077), ('vanilla', 0.075), ('shrinking', 0.074), ('lj', 0.071), ('nb', 0.069), ('heuristic', 0.067), ('gradients', 0.064), ('loss', 0.064), ('optimization', 0.054), ('dataset', 0.052), ('boyles', 0.052), ('korattikara', 0.052), ('kskip', 0.052), ('squashed', 0.052), ('stages', 0.051), ('regression', 0.05), ('accept', 0.05), ('direction', 0.05), ('stop', 0.05), ('pascal', 0.05), ('bi', 0.049), ('pdf', 0.048), ('speedups', 0.048), ('seconds', 0.047), ('reliability', 0.047), ('skipping', 0.046), ('adverse', 0.046), ('grenoble', 0.046), ('wrong', 0.043), ('welling', 0.043), ('ar', 0.043), ('zj', 0.042), ('quantiles', 0.042), ('derivatives', 0.04), ('xi', 0.04), ('partial', 0.04), ('test', 0.04), ('formulate', 0.039), ('realization', 0.039), ('datapoints', 0.037), ('impressive', 0.037), ('citeseer', 0.037), ('testing', 0.037), ('decay', 0.037), ('statistical', 0.036), ('correct', 0.036), ('accepted', 0.036), ('tricks', 0.036), ('failing', 0.036), ('straight', 0.036), ('want', 0.035), ('failures', 0.034), ('standardized', 0.034), ('stochastic', 0.034), ('central', 0.034), ('determine', 0.033), ('schedules', 0.033), ('competitive', 0.033), ('averages', 0.033), ('intrinsic', 0.032), ('const', 0.032), ('justi', 0.032), ('bottou', 0.031), ('irvine', 0.031), ('worked', 0.031), ('online', 0.031), ('training', 0.031), ('shifted', 0.03), ('employ', 0.03), ('ij', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
2 0.24250574 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
3 0.21607222 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
4 0.19068058 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
5 0.15591924 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1
6 0.13853215 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
7 0.13497397 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
8 0.11862093 165 nips-2011-Matrix Completion for Multi-label Image Classification
9 0.11388683 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
10 0.10384889 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.10028391 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
12 0.095091477 276 nips-2011-Structured sparse coding via lateral inhibition
13 0.095037676 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
14 0.081859395 180 nips-2011-Multiple Instance Filtering
15 0.076847836 72 nips-2011-Distributed Delayed Stochastic Optimization
16 0.075833254 178 nips-2011-Multiclass Boosting: Theory and Algorithms
17 0.074693576 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
18 0.071857974 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
19 0.067424245 202 nips-2011-On the Universality of Online Mirror Descent
20 0.067022935 258 nips-2011-Sparse Bayesian Multi-Task Learning
topicId topicWeight
[(0, 0.23), (1, -0.02), (2, -0.078), (3, -0.061), (4, -0.025), (5, 0.105), (6, 0.022), (7, -0.037), (8, -0.169), (9, 0.084), (10, 0.078), (11, -0.132), (12, -0.156), (13, -0.148), (14, -0.057), (15, 0.228), (16, -0.05), (17, 0.124), (18, 0.01), (19, 0.106), (20, -0.091), (21, 0.048), (22, -0.117), (23, -0.011), (24, -0.035), (25, 0.143), (26, -0.049), (27, 0.062), (28, -0.004), (29, 0.162), (30, 0.05), (31, -0.085), (32, 0.009), (33, -0.113), (34, -0.056), (35, -0.002), (36, -0.019), (37, 0.1), (38, 0.008), (39, 0.054), (40, 0.035), (41, -0.008), (42, 0.072), (43, 0.052), (44, -0.024), (45, 0.019), (46, 0.08), (47, -0.095), (48, 0.055), (49, 0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.94627434 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
2 0.78584003 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
3 0.77795315 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
4 0.77115422 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1
5 0.70205528 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
6 0.59140581 72 nips-2011-Distributed Delayed Stochastic Optimization
7 0.56809074 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
8 0.54316723 202 nips-2011-On the Universality of Online Mirror Descent
9 0.45861369 143 nips-2011-Learning Anchor Planes for Classification
10 0.44589612 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
11 0.44248363 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
12 0.44111583 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
13 0.43410787 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
14 0.43402502 4 nips-2011-A Convergence Analysis of Log-Linear Training
15 0.41794708 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
16 0.41342118 276 nips-2011-Structured sparse coding via lateral inhibition
17 0.39974317 197 nips-2011-On Tracking The Partition Function
18 0.39007396 111 nips-2011-Hashing Algorithms for Large-Scale Learning
19 0.37904334 69 nips-2011-Differentially Private M-Estimators
20 0.37596259 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
topicId topicWeight
[(0, 0.023), (4, 0.035), (20, 0.037), (26, 0.051), (31, 0.081), (33, 0.032), (43, 0.075), (45, 0.161), (57, 0.039), (65, 0.015), (74, 0.065), (83, 0.078), (85, 0.131), (99, 0.109)]
simIndex simValue paperId paperTitle
1 0.93077993 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
Author: Tzu-kuo Huang, Jeff G. Schneider
Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1
2 0.92307615 299 nips-2011-Variance Penalizing AdaBoost
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1
3 0.90506881 35 nips-2011-An ideal observer model for identifying the reference frame of objects
Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths
Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1
same-paper 4 0.89228952 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
5 0.86709064 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.8491652 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
7 0.848858 263 nips-2011-Sparse Manifold Clustering and Embedding
8 0.84820133 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
9 0.84365273 202 nips-2011-On the Universality of Online Mirror Descent
10 0.84171969 29 nips-2011-Algorithms and hardness results for parallel large margin learning
11 0.84046376 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
12 0.84015644 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
13 0.83975101 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
14 0.83560395 80 nips-2011-Efficient Online Learning via Randomized Rounding
15 0.83465719 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
16 0.83424079 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
17 0.83368719 258 nips-2011-Sparse Bayesian Multi-Task Learning
18 0.83230728 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
19 0.8321687 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
20 0.83103365 102 nips-2011-Generalised Coupled Tensor Factorisation