nips nips2010 nips2010-202 knowledge-graph by maker-knowledge-mining

202 nips-2010-Parallelized Stochastic Gradient Descent

Source: pdf

Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

Abstract: With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the ﬁrst parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8]. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract With the increase in available data parallel machine learning has become an increasingly pressing problem. [sent-10, score-0.107]

2 In this paper we present the ﬁrst parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. [sent-11, score-0.42]

3 Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. [sent-12, score-0.493]

4 Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. [sent-13, score-0.152]

5 As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8]. [sent-14, score-0.367]

6 1 Introduction Over the past decade the amount of available data has increased steadily. [sent-15, score-0.077]

7 Given that the bandwidth of storage and network per computer has not been able to keep up with the increase in data, the need to design data analysis algorithms which are able to perform most steps in a distributed fashion without tight constraints on communication has become ever more pressing. [sent-17, score-0.123]

8 At current disk bandwidth and capacity (2TB at 100MB/s throughput) it takes at least 6 hours to read the content of a single harddisk. [sent-19, score-0.126]

9 For a decade, the move from batch to online learning algorithms was able to deal with increasing data set sizes, since it reduced the runtime behavior of inference algorithms from cubic or quadratic to linear in the sample size. [sent-20, score-0.123]

10 However, whenever we have more than a single disk of data, it becomes computationally infeasible to process all data by stochastic gradient descent which is an inherently sequential algorithm, at least if we want the result within a matter of hours rather than days. [sent-21, score-0.388]

11 Three recent papers attempted to break this parallelization barrier, each of them with mixed success. [sent-22, score-0.222]

12 [5] show that parallelization is easily possible for the multicore setting where we have a tight coupling of the processing units, thus ensuring extremely low latency between the processors. [sent-23, score-0.423]

13 In particular, for non-adversarial settings it is possible to obtain algorithms which scale perfectly in the number of processors, both in the case of bounded gradients and in the strongly convex case. [sent-24, score-0.106]

14 Unfortunately, these algorithms are not applicable to a MapReduce setting since the latter is fraught with considerable latency and bandwidth constraints between the computers. [sent-25, score-0.147]

15 In a nutshell, they rely on distributed computation of gradients locally on each computer which holds parts of the data and subsequent aggregation of gradients to perform a global update step. [sent-27, score-0.101]

16 [7] attempted to deal with this issue by a rather ingenious strategy: solve the sub-problems exactly on each processor and in the end average these solutions to obtain a joint solution. [sent-34, score-0.152]

17 The key advantage of this strategy is that only a single MapReduce pass is required, thus dramatically reducing the amount of communication. [sent-35, score-0.136]

18 However, since regularization tends to decrease with increasing sample size the guarantees become increasingly loose in practice as we see more data. [sent-39, score-0.073]

19 We attempt to combine the beneﬁts of a single-average strategy as proposed by [7] with asymptotic analysis [8] of online learning. [sent-40, score-0.125]

20 Our proposed algorithm is strikingly simple: denote by ci (w) a loss function indexed by i and with parameter w. [sent-41, score-0.17]

21 Then each processor carries out stochastic gradient descent on the set of ci (w) with a ﬁxed learning rate η for T steps as described in Algorithm 1. [sent-42, score-0.6]

22 , cm }, T, η, w0 ) for t = 1 to T do Draw j ∈ {1 . [sent-46, score-0.075]

23 This means that if we process T instances per machine, each processor ends up T 1 seeing m of the data which is likely to exceed k . [sent-63, score-0.115]

24 Algorithm Distributed subgradient [3, 9] Distributed convex solver [7] Multicore stochastic gradient [5] This paper Latency tolerance moderate high low high MapReduce yes yes no yes Network IO high low n. [sent-64, score-0.562]

25 Large scale learning, as deﬁned in [2], is when an algorithm is bounded by the time available instead of by the amount of data available. [sent-67, score-0.074]

26 Practically speaking, that means that one can consider the actual data in the real dataset to be a subset of a virtually inﬁnite set, and drawing with replacement (as the theory here implies) and drawing without replacement on the 2 Algorithm 3 SimuParallelSGD(Examples {c1 , . [sent-68, score-0.166]

27 cm }, Learning Rate η, Machines k) Deﬁne T = m/k Randomly partition the examples, giving T examples to each machine. [sent-71, score-0.075]

28 k} parallel do Randomly shufﬂe the data on machine i. [sent-75, score-0.107]

29 T }: do Get the tth example on the ith machine (this machine), ci,t wi,t ← wi,t−1 − η∂w ci (wi,t−1 ) end for end for k 1 Aggregate from all computers v = k i=1 wi,t and return v. [sent-80, score-0.215]

30 Our paper applies an anytime algorithm via stochastic gradient descent. [sent-84, score-0.203]

31 The algorithm requires no communication between machines until the end. [sent-85, score-0.129]

32 The amount of time required is independent of the number of examples, only depending upon the regularization parameter and the desired error at the end. [sent-88, score-0.104]

33 Before delving into details we brieﬂy outline the proof strategy: • When performing stochastic gradient descent with ﬁxed (and sufﬁciently small) learning rate η the distribution of the parameter vector is asymptotically normal [1, 8]. [sent-92, score-0.45]

34 Since all computers are drawing from the same data distribution they all converge to the same limit. [sent-93, score-0.172]

35 1 • Averaging between the parameter vectors of k computers reduces variance by O(k − 2 ) similar to the result of [7]. [sent-94, score-0.125]

36 • To show that the bias due to joint initialization decreases we need to show that the distribution of parameters per machine converges sufﬁciently quickly to the limit distribution. [sent-96, score-0.104]

37 • Finally, we also need to show that the mean of the limit distribution for ﬁxed learning rate is sufﬁciently close to the risk minimizer. [sent-97, score-0.174]

38 That is, we need to take ﬁnite-size learning rate effects into account relative to the asymptotically normal regime. [sent-98, score-0.142]

39 1 Loss and Contractions In this paper we consider estimation with convex loss functions ci : 2 → [0, ∞). [sent-100, score-0.17]

40 For instance, in the case of regularized risk minimization we have ci (w) = λ w 2 2 + L(xi , y i , w · xi ) (1) where L is a convex function in w·xi , such as 1 (y i −w·xi )2 for regression or log[1+exp(−y i w·xi )] 2 for binary classiﬁcation. [sent-102, score-0.182]

41 The goal is to ﬁnd an approximate minimizer of the overall risk c(w) = 1 m m ci (w). [sent-103, score-0.182]

42 (2) i=1 To deal with stochastic gradient descent we need tools for quantifying distributions over w. [sent-104, score-0.313]

43 3 H¨ lder continuity: A function f is H¨ lder continous with constant L and exponent α if |f (x) − o o f (y)| ≤ Ldα (x, y) for all x, y ∈ X. [sent-106, score-0.144]

44 H¨ lder seminorm: Extending the Lipschitz norm for α ≥ 1: o f Lipα := inf {l where |f (x) − f (y)| ≤ ldα (x, y) for all x, y ∈ X} . [sent-110, score-0.111]

45 Contraction: For a metric space (M, d), f : M → M is a contraction mapping if f Lip (4) < 1. [sent-111, score-0.484]

46 Theorem 1 (Banach’s Fixed Point Theorem) If (M, d) is a non-empty complete metric space, then any contraction mapping f on (M, d) has a unique ﬁxed point x∗ = f (x∗ ). [sent-113, score-0.484]

47 Our strategy is to show that the stochastic gradient descent mapping w ← φi (w) := w − η ci (w) (5) is a contraction, where i is selected uniformly at random from {1, . [sent-115, score-0.568]

48 Then if η ≤ the update rule (5) is a contraction mapping in 2 with Lipschitz constant 1 − ηλ. [sent-122, score-0.436]

49 If we choose η “low enough”, gradient descent uniformly becomes a contraction. [sent-124, score-0.207]

50 Contraction for Distributions For ﬁxed learning rate η stochastic gradient descent is a Markov process with state vector w. [sent-127, score-0.361]

51 While there is considerable research regarding the asymptotic properties of this process [1, 8], not much is known regarding the number of iterations required until the asymptotic regime is assumed. [sent-128, score-0.128]

52 We now address the latter by extending the notion of contractions from mappings of points to mappings of distributions. [sent-129, score-0.266]

53 The Wasserstein distance between two distributions X, Y ∈ P (M, d) is Wz (X, Y ) = z inf γ∈Γ(X,Y ) d (x, y)d γ(x, y) 1 z (7) x,y where Γ(X, Y ) is the set of probability distributions on (M, d) × (M, d) with marginals X and Y . [sent-132, score-0.075]

54 This metric has two very important properties: it is complete and a contraction in (M, d) induces a contraction in (P (M, d), Wz ). [sent-133, score-0.734]

55 Given a mapping φ : M → M , we can construct p : P (M, d) → P (M, d) by applying φ pointwise to M . [sent-134, score-0.093]

56 4 Lemma 5 Given a metric space (M, d) and a contraction mapping φ on (M, d) with constant c, p is a contraction mapping on (P (M, d), Wz ) with constant c. [sent-138, score-0.92]

57 This shows that any single mapping is a contraction. [sent-140, score-0.093]

58 However, since we draw ci at random we need to show that a mixture of such mappings is a contraction, too. [sent-141, score-0.229]

59 Here the fact that we operate on distributions comes handy since the mixture of mappings on distribution is a mapping on distributions. [sent-142, score-0.233]

60 pk are contraction mappings with constants k c1 . [sent-146, score-0.448]

61 ck with respect to Wz , and i ai = 1 where ai ≥ 0, then p = i=1 ai pi is a contrac1 z z tion mapping with a constant of no more than [ i ai (ci ) ] . [sent-149, score-0.285]

62 Corollary 7 If for all i, ci ≤ c, then p is a contraction mapping with a constant of no more than c. [sent-150, score-0.56]

63 We apply this to SGD as follows: Deﬁne p∗ = m i=1 pi to be the 0 stochastic operation in one step. [sent-152, score-0.106]

64 Then the following holds: Theorem 8 For any z ∈ N, if η ≤ η ∗ , then p∗ is a contraction mapping on (M, Wz ) with contrac∗ ∗ ∗ tion rate (1 − ηλ). [sent-154, score-0.516]

65 The contraction rate (1 − ηλ) can be proven by applying Lemma 3, ∗ Lemma 5, and Corollary 6. [sent-158, score-0.49]

66 λ This means that for a suitable choice of η we achieve exponentially fast convergence in T to some ∗ stationary distribution Dη . [sent-161, score-0.133]

67 Note that this distribution need not be centered at the risk minimizer of c(w). [sent-162, score-0.093]

68 3 Guarantees for the Stationary Distribution At this point, we know there exists a stationary distribution, and our algorithms are converging to that distribution exponentially fast. [sent-165, score-0.133]

69 However, unlike in traditional gradient descent, the stationary distribution is not necessarily just the optimal point. [sent-166, score-0.23]

70 In particular, the harder parts of understanding this algorithm involve understanding the properties of the stationary distribution. [sent-167, score-0.098]

71 First, we show that the mean of the stationary distribution has low error. [sent-168, score-0.133]

72 Secondly, we show that the squared distance from the optimal point, and therefore the variance, is low. [sent-172, score-0.123]

73 ∗ Theorem 10 The average squared distance of Dη from the optimal point is bounded by: ∗ Ew∈Dη [(w − w∗ )2 ] ≤ 4ηG2 . [sent-173, score-0.162]

74 (2 − ηλ)λ In other words, the squared distance is bounded by O(ηG2 /λ). [sent-174, score-0.162]

75 Throughout the appendix, we develop tools to show that the distribution ∗ over the output vector of the algorithm is “near” µDη , the mean of the stationary distribution. [sent-177, score-0.133]

76 In T,k particular, if Dη is the distribution over the ﬁnal vector of ParallelSGD after T iterations on each T,k ∗ of k machines with a learning rate η, then W2 (µDη , Dη ) = ∗ Ex∈Dη ,k [(x − µDη )2 ] becomes T small. [sent-178, score-0.246]

77 Then, we need to connect the error of the mean of the stationary distribution to a distribution that is near to this mean. [sent-179, score-0.205]

78 Modulo logarithmic factors, we require λ machines to run ηλ 1 time, which is the same order of computation, but a dramatic speedup of a factor of λ in wall clock time. [sent-194, score-0.329]

79 During training, the model is stored on disk after k = 10, 000 ∗ 2i updates. [sent-207, score-0.075]

80 This approach is geared towards the estimation of the parallelization ability of our optimization algorithm and its application to machine learning equally. [sent-210, score-0.185]

81 Evaluation measures: We report both the normalized root mean squared error (RMSE) on the test set and the normalized value of the objective function during training. [sent-212, score-0.161]

82 0 is the RMSE obtained by training a model in one single, sequential pass over the data. [sent-214, score-0.095]

83 The objective function values are normalized in much the same way such that the objective function value of a single, full sequential pass over the data reaches the value 1. [sent-215, score-0.137]

84 Conﬁgurations: We studied both the Huber and the squared error loss. [sent-217, score-0.124]

85 While the latter does not satisfy all the assumptions of our proofs (its gradient is unbounded), it is included due to its popularity. [sent-218, score-0.097]

86 1 Results and Discussion Optimization: Figure 1 shows the relative objective function values for training using 1, 10 and 100 machines with λ = 1e−3 . [sent-222, score-0.238]

87 In terms of wall clock time, the models obtained on 100 machines clearly outperform the ones obtained on 10 machines, which in turn outperform the model trained on a single machine. [sent-223, score-0.329]

88 There is no signiﬁcant difference in behavior between the squared error and the Huber loss in these experiments, despite the fact that the squared error is effectively unbounded. [sent-224, score-0.294]

89 Thus, the parallelization works in the sense that many machines obtain a better objective function value after each machine has seen k instances. [sent-225, score-0.351]

90 Additionally, the results also show that data-local parallelized training is feasible and beneﬁcial with the proposed algorithm in practice. [sent-226, score-0.093]

91 Note that the parallel training needs slightly more machine time to obtain the same objective function value, which is to be expected. [sent-227, score-0.176]

92 Also unsurprising, yet noteworthy, is the trade-off between the number of machines and the quality of the solution: The solution obtained by 10 machines is much more of an improvement over using one machine than using 100 machines is over 10. [sent-228, score-0.387]

93 Predictive Performance: Figure 2 shows the relative test RMSE for 1, 10 and 100 machines with λ = 1e−3 . [sent-229, score-0.169]

94 As expected, the results are very similar to the objective function comparison: The parallel training decreases wall clock time at the price of slightly higher machine time. [sent-230, score-0.376]

95 Again, the gain in performance between 1 and 10 machines is much higher than the one between 10 and 100. [sent-231, score-0.129]

96 The plots exhibit exactly this characteristic: For λ = 1e−6 , the loss for 10 and 100 machines not only drops faster, but the ﬁnal solution for both beats the solution found by a single pass, adding further empirical evidence for the behaviour predicted by our theory. [sent-234, score-0.175]

97 Our analysis of the algorithm’s performance is based on a novel technique that uses contraction theory to quantify ﬁnite-sample convergence rate of stochastic gradient descent. [sent-239, score-0.594]

98 We show worst-case bounds that are comparable to stochastic gradient descent in terms of wall clock time, and vastly faster in terms of overall time. [sent-240, score-0.513]

99 Lastly, our experiments on a large-scale real world dataset show that the parallelization reduces the wall-clock time needed to obtain a set solution quality. [sent-241, score-0.185]

100 Adaptive subgradient methods for online learning and stochastic optimization. [sent-270, score-0.146]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('contraction', 0.343), ('wz', 0.291), ('mapreduce', 0.246), ('lip', 0.215), ('parallelization', 0.185), ('sgd', 0.154), ('huber', 0.15), ('lipschitz', 0.134), ('machines', 0.129), ('ci', 0.124), ('ew', 0.123), ('sunnyvale', 0.123), ('clock', 0.123), ('processor', 0.115), ('ld', 0.113), ('descent', 0.11), ('appendix', 0.109), ('parallel', 0.107), ('stochastic', 0.106), ('mappings', 0.105), ('multicore', 0.105), ('proven', 0.099), ('stationary', 0.098), ('gradient', 0.097), ('latency', 0.096), ('mapping', 0.093), ('yes', 0.092), ('computers', 0.091), ('labs', 0.089), ('squared', 0.087), ('shuf', 0.085), ('yahoo', 0.083), ('rmse', 0.081), ('wall', 0.077), ('disk', 0.075), ('cm', 0.075), ('lder', 0.072), ('parallelsgd', 0.07), ('radon', 0.07), ('seminorm', 0.07), ('simuparallelsgd', 0.07), ('wasserstein', 0.07), ('smola', 0.069), ('wt', 0.064), ('pass', 0.063), ('weimer', 0.061), ('parallelized', 0.061), ('risk', 0.058), ('contractions', 0.056), ('asymptotically', 0.054), ('lihong', 0.053), ('continuity', 0.053), ('bandwidth', 0.051), ('hashing', 0.05), ('corollary', 0.05), ('minw', 0.048), ('metric', 0.048), ('rate', 0.048), ('asymptotic', 0.047), ('solver', 0.047), ('loss', 0.046), ('drawing', 0.046), ('batch', 0.045), ('theorem', 0.042), ('decade', 0.042), ('guarantees', 0.041), ('ai', 0.04), ('relative', 0.04), ('online', 0.04), ('inf', 0.039), ('bounded', 0.039), ('runtime', 0.038), ('strategy', 0.038), ('langford', 0.037), ('attempted', 0.037), ('error', 0.037), ('tight', 0.037), ('ln', 0.037), ('objective', 0.037), ('aggregate', 0.037), ('replacement', 0.037), ('ca', 0.037), ('distance', 0.036), ('bias', 0.036), ('tolerance', 0.036), ('amount', 0.035), ('lemma', 0.035), ('distributed', 0.035), ('distribution', 0.035), ('alex', 0.035), ('perfectly', 0.034), ('iterations', 0.034), ('variance', 0.034), ('ing', 0.034), ('learners', 0.033), ('limit', 0.033), ('gradients', 0.033), ('tion', 0.032), ('training', 0.032), ('regularization', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 202 nips-2010-Parallelized Stochastic Gradient Descent

Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

2 0.13295713 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

3 0.10436456 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is conﬁrmed both by theoretical lower bounds and simulations for various networks. 1

4 0.092385001 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efﬁcient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efﬁcient method for implementing mixture forecasting strategies. 1

5 0.090490945 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari

Abstract: We develop a theory of online learning by deﬁning several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1

6 0.082347505 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

7 0.08163967 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions

8 0.079434745 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

9 0.076186888 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

10 0.075278506 265 nips-2010-The LASSO risk: asymptotic results and real world examples

11 0.074009188 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

12 0.069801584 208 nips-2010-Policy gradients in linearly-solvable MDPs

13 0.069251969 177 nips-2010-Multitask Learning without Label Correspondences

14 0.067754388 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

15 0.065648042 148 nips-2010-Learning Networks of Stochastic Differential Equations

16 0.065430172 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

17 0.06454251 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification

18 0.063603275 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

19 0.062959053 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

20 0.062905155 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.22), (1, -0.01), (2, 0.109), (3, 0.008), (4, 0.039), (5, 0.043), (6, -0.011), (7, -0.005), (8, -0.063), (9, 0.021), (10, 0.026), (11, -0.06), (12, 0.026), (13, 0.004), (14, -0.053), (15, 0.015), (16, -0.005), (17, -0.007), (18, 0.012), (19, -0.023), (20, 0.042), (21, 0.006), (22, 0.038), (23, -0.004), (24, -0.002), (25, -0.046), (26, 0.043), (27, -0.03), (28, -0.005), (29, -0.004), (30, -0.078), (31, -0.062), (32, 0.036), (33, -0.076), (34, -0.061), (35, 0.044), (36, -0.009), (37, -0.077), (38, 0.004), (39, -0.035), (40, 0.009), (41, -0.064), (42, -0.001), (43, 0.018), (44, 0.145), (45, -0.094), (46, -0.006), (47, 0.001), (48, 0.031), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92589527 202 nips-2010-Parallelized Stochastic Gradient Descent

Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

2 0.67237049 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of ﬁrst-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We deﬁne appropriately restricted versions of these conditions, and show that they are satisﬁed with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s ﬁrst-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for speciﬁc methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational eﬃciency in high-dimensional estimation. 1

3 0.65948433 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

4 0.65255886 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan

Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations sufﬁce. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1

5 0.63972849 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

6 0.62726128 226 nips-2010-Repeated Games against Budgeted Adversaries

7 0.61755276 61 nips-2010-Direct Loss Minimization for Structured Prediction

8 0.60632646 118 nips-2010-Implicit Differentiation by Perturbation

9 0.60062087 269 nips-2010-Throttling Poisson Processes

10 0.59818667 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods

11 0.58045101 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

12 0.57923645 63 nips-2010-Distributed Dual Averaging In Networks

13 0.57621771 183 nips-2010-Non-Stochastic Bandit Slate Problems

14 0.56463748 289 nips-2010-b-Bit Minwise Hashing for Estimating Three-Way Similarities

15 0.55227697 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

16 0.55147666 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

17 0.54473275 283 nips-2010-Variational Inference over Combinatorial Spaces

18 0.53965002 107 nips-2010-Global seismic monitoring as probabilistic inference

19 0.53260458 290 nips-2010-t-logistic regression

20 0.52685612 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.036), (17, 0.023), (27, 0.047), (30, 0.101), (35, 0.013), (42, 0.248), (45, 0.193), (50, 0.073), (52, 0.048), (60, 0.03), (77, 0.07), (78, 0.023), (90, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84714681 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

Author: Prateek Jain, Raghu Meka, Inderjit S. Dhillon

Abstract: Minimizing the rank of a matrix subject to afﬁne constraints is a fundamental problem with many important applications in machine learning and statistics. In this paper we propose a simple and fast algorithm SVP (Singular Value Projection) for rank minimization under afﬁne constraints (ARMP) and show that SVP recovers the minimum rank solution for afﬁne constraints that satisfy a restricted isometry property (RIP). Our method guarantees geometric convergence rate even in the presence of noise and requires strictly weaker assumptions on the RIP constants than the existing methods. We also introduce a Newton-step for our SVP framework to speed-up the convergence with substantial empirical gains. Next, we address a practically important application of ARMP - the problem of lowrank matrix completion, for which the deﬁning afﬁne constraints do not directly obey RIP, hence the guarantees of SVP do not hold. However, we provide partial progress towards a proof of exact recovery for our algorithm by showing a more restricted isometry property and observe empirically that our algorithm recovers low-rank incoherent matrices from an almost optimal number of uniformly sampled entries. We also demonstrate empirically that our algorithms outperform existing methods, such as those of [5, 18, 14], for ARMP and the matrix completion problem by an order of magnitude and are also more robust to noise and sampling schemes. In particular, results show that our SVP-Newton method is signiﬁcantly robust to noise and performs impressively on a more realistic power-law sampling scheme for the matrix completion problem. 1

same-paper 2 0.80180299 202 nips-2010-Parallelized Stochastic Gradient Descent

Author: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola

3 0.78020006 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr

Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overﬁtting. We propose formulating the L1 regularized linear ﬁxed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efﬁcient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efﬁciency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modiﬁed policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1

4 0.69445795 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

5 0.69367242 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-speciﬁc characterization of the sample complexity of large-margin classiﬁcation with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-speciﬁc upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classiﬁcation. The bounds hold for a rich family of sub-Gaussian distributions. 1

6 0.69236386 117 nips-2010-Identifying graph-structured activation patterns in networks

7 0.692307 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes

8 0.69202828 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

9 0.69179952 148 nips-2010-Learning Networks of Stochastic Differential Equations

10 0.69112992 155 nips-2010-Learning the context of a category

11 0.68707681 63 nips-2010-Distributed Dual Averaging In Networks

12 0.6850751 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

13 0.68483317 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

14 0.68444961 243 nips-2010-Smoothness, Low Noise and Fast Rates

15 0.6838485 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

16 0.68328309 265 nips-2010-The LASSO risk: asymptotic results and real world examples

17 0.68178642 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

18 0.68147761 280 nips-2010-Unsupervised Kernel Dimension Reduction

19 0.67911118 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

20 0.67899716 282 nips-2010-Variable margin losses for classifier design