nips nips2011 nips2011-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. [sent-4, score-0.904]
2 The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. [sent-5, score-1.479]
3 Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. [sent-6, score-0.485]
4 In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. [sent-7, score-0.732]
5 1 Introduction We focus on stochastic convex optimization problems of the form minimize f (x) x∈X for F (x; ξ)dP (ξ), f (x) := EP [F (x; ξ)] = (1) Ξ where X ⊆ Rd is a closed convex set, P is a probability distribution over Ξ, and F (· ; ξ) is convex for all ξ ∈ Ξ, so that f is convex. [sent-8, score-0.321]
6 Classical stochastic gradient algorithms [18, 16] iteratively update a parameter x(t) ∈ X by sampling ξ ∼ P , computing g(t) = ∇F (x(t); ξ), and performing the update x(t + 1) = ΠX (x(t) − α(t)g(t)), where ΠX denotes projection onto the set X and α(t) ∈ R is a stepsize. [sent-9, score-0.441]
7 In this paper, we analyze asynchronous gradient methods, where instead of receiving current information g(t), the procedure receives out of date gradients g(t − τ (t)) = ∇F (x(t − τ (t)), ξ), where τ (t) is the (potentially random) delay at time t. [sent-10, score-0.836]
8 The central contribution of this paper is to develop algorithms that—under natural assumptions about the functions F in the objective (1)—achieve asymptotically optimal convergence rates for stochastic convex optimization in spite of delays. [sent-11, score-0.598]
9 Our model of delayed gradient information is particularly relevant in distributed optimization scenarios, where a master maintains the parameters x while workers compute stochastic gradients of the objective (1) using a local subset of the data. [sent-12, score-1.621]
10 By allowing delayed and asynchronous updates, we can avoid synchronization issues that commonly handicap distributed systems. [sent-14, score-0.729]
11 Distributed optimization has been studied for several decades, tracing back at least to seminal work of Bertsekas and Tsitsiklis ([3, 19, 4]) on asynchronous computation and minimization of smooth functions where the parameter vector is distributed. [sent-15, score-0.28]
12 ’s asynchronous incremental subgradient method [12], who c 1 Master g1 (t − n) x(t) x(t + 1) g2 (t − n + 1) 1 2 gn (t − 1) x(t + n − 1) 3 n Figure 1: Cyclic delayed update architecture. [sent-18, score-0.71]
13 Workers compute gradients in parallel, passing outof-date (stochastic) gradients gi (t − τ ) = ∇fi (x(t − τ )) to master. [sent-19, score-0.483]
14 prove that the procedure converges, and a slight extension of their c results shows that the optimization error of the procedure after T iterations is at most O( τ /T ), τ being the delay in gradients. [sent-25, score-0.259]
15 √ Without delay, a centralized stochastic gradient algorithm attains convergence rate O(1/ T ). [sent-26, score-0.604]
16 All the approaches mentioned above give slower convergence than this centralized rate in distributed settings, paying a penalty for data being split across a network; as Dekel et al. [sent-27, score-0.468]
17 [10] also study asynchronous methods in the setup of stochastic optimization and attempt to remove the penalty for the delayed procedure under an additional smoothness assumption; however, their paper has a technical error (see the long version [2] for details). [sent-30, score-0.896]
18 The main contributions of our paper are (1) to remove the delay penalty for smooth functions and (2) to demonstrate benefits in convergence rate by leveraging parallel computation even in spite of delays. [sent-31, score-0.464]
19 [8, 9]) to show that for smooth objectives f , when n processors compute stochastic gradients in parallel using a common parameter x it is possible to achieve conver√ gence rate O(1/ T n). [sent-35, score-0.512]
20 We show similar results, but we analyze the effects of asynchronous gradient updates where all the nodes in the network can suffer delays, quantifying the impact of the delays. [sent-37, score-0.505]
21 In application to distributed optimization, we show that under different network √ assumptions, we achieve convergence rates ranging from O(min{n3 /T, (n/T )2/3 } + 1/ T n) √ √ to O(min{n/T, 1/T 2/3 } + 1/ T n), which is O(1/ nT ) asymptotically in T . [sent-38, score-0.381]
22 The time necessary to achieve ǫ-optimal solution to the problem (1) is asymptotically O(1/nǫ2 ), a factor of n–the size of the network –better than a centralized procedure in spite of delay. [sent-39, score-0.433]
23 α(t + 1) (3) For the remainder of the paper, we will use the following three essentially standard assumptions [8, 9, 20] about the stochastic optimization problem (1). [sent-55, score-0.264]
24 Under Assumption I, √ dual averaging satisfies E[f (x(T ))] − f (x∗ ) = O(RG/ T ) for the stepsize choice α(t) = √ R/(G t) [15, 20]. [sent-70, score-0.337]
25 [5, √ Appendix A] show that the stepsize choice α(t)−1 = L + σR t, yields the convergence rate E[f (x(T ))] − f (x∗ ) = O σR LR2 +√ T T . [sent-74, score-0.259]
26 [12] developed with the convergence proofs of dual c averaging [15], it is possible to show that so long as E[τ (t)] ≤ B < ∞ for all t, Assumptions I √ √ R and III and the stepsize choice α(t) = G√Bt give E[f (x(T ))] − f (x∗ ) = O(RG B/ T ). [sent-79, score-0.468]
27 z(t + 1) = z(t) + g(t − τ (t)) and x(t + 1) = argmin 3 z(t + 1), x + Convergence rates for delayed optimization of smooth functions We now state and discuss the implications of two results for asynchronous stochastic gradient methods. [sent-81, score-1.028]
28 Our first convergence result is for the update rule (5), while the second averages several stochastic subgradients for every update, each with a potentially different delay. [sent-82, score-0.355]
29 1 Simple delayed optimization √ Our focus in this section is to remove the B penalty for the delayed update rule (5) using Assumption II, which arises for non-smooth optimization because subgradients can vary drastically even when measured at near points. [sent-84, score-0.985]
30 We show that under the smoothness condition, the errors from delay become second order: the penalty is asymptotically negligible. [sent-85, score-0.274]
31 Furthermore, though we omit it here for lack of space, the analysis also extends to random delays as long as E[τ (t)2 ] ≤ B 2 ; see the long version [2] for details. [sent-92, score-0.327]
32 The take-home message from Theorem 1 is thus that the penalty in convergence rate due to the delay τ (t) is asymptotically negligible. [sent-94, score-0.383]
33 In the next section, we show the implications of this result for robust distributed stochastic optimization algorithms. [sent-95, score-0.379]
34 2 Combinations of delays In some scenarios—including distributed settings similar to those we discuss in the next section—the procedure has access not to only a single delayed gradient but to several stochastic gradients with different delays. [sent-97, score-1.21]
35 To abstract away the essential parts of this situation, we assume that the procedure receives n stochastic gradients g1 , . [sent-98, score-0.434]
36 Then the procedure performs the following updates at time t: n z(t + 1) = z(t) + i=1 λi gi (t − τ (i)), x(t + 1) = argmin z(t + 1), x + x∈X 1 ψ(x) . [sent-103, score-0.286]
37 4 Distributed Optimization We now turn to what we see as the main purpose and application of the above results: developing robust and efficient algorithms for distributed stochastic optimization. [sent-109, score-0.324]
38 We divide the N samples among n workers so that each worker has an N/n-sized subset of data. [sent-115, score-0.575]
39 In online learning applications, the distribution P is the unknown distribution generating the data, and each worker receives a stream of independent data points ξ ∼ P . [sent-116, score-0.478]
40 Worker i uses its subset of the data, or its stream, to compute gi ∈ Rd , an estimate of the gradient ∇f of the global f . [sent-117, score-0.334]
41 We assume that gi is an unbiased estimate of ∇f (x), which is satisfied, for example, in the online setting or when each worker computes the gradient gi based on samples picked at random without replacement from its subset of the data. [sent-118, score-0.897]
42 The architectural assumptions we make are based off of master/worker topologies, but the convergence results in Section 3 allow us to give procedures robust to delay and asynchrony. [sent-119, score-0.293]
43 A node at distance d from master has the parameter x(t − d). [sent-122, score-0.378]
44 A node at distance d from master computes gradient g(t − d). [sent-124, score-0.585]
45 1 3 g1 (t − d) + 1 g2 (t − d − 2) + 1 g3 (t − d − 2) 3 3 Depth d 1 g2 (t − d − 1) Depth d + 1 2 {x(t − d), g2 (t − d − 2), g3 (t − d − 2)} Figure 3: Communication of gradient information toward master node at time t from node 1 at distance d from master. [sent-125, score-0.654]
46 While the n gradients are computed in parallel in the na¨ scheme, ıve accumulating and averaging n gradients at the master takes Ω(n) time, offsetting the gains of parallelization, and the procedure is non-robust to laggard workers. [sent-128, score-0.797]
47 Cyclic Delayed Architecture This protocol is the delayed update algorithm mentioned in the introduction, and it computes n stochastic gradients of f (x) in parallel. [sent-129, score-0.836]
48 Formally, worker i has parameter x(t−τ ) and computes gi (t−τ ) = ∇F (x(t−τ ); ξi (t)) ∈ Rd , where ξi (t) is a random variable sampled at worker i from the distribution P . [sent-130, score-0.861]
49 At time t, the master receives gi (t − τ ) from some worker i, computes x(t + 1) and passes it back to worker i only. [sent-132, score-1.264]
50 Other workers do not see x(t + 1) and continue their gradient computations on stale parameter vectors. [sent-133, score-0.368]
51 In the simplest case, each node suffers a delay of τ = n, though our analysis applies to random delays as well. [sent-134, score-0.419]
52 Locally Averaged Delayed Architecture At a high level, the protocol we now describe combines the delayed updates of the cyclic delayed architecture with averaging techniques of previous work [13, 7]. [sent-137, score-1.209]
53 The algorithm works via a series of multicasting and aggregation steps on a spanning tree rooted at the master node. [sent-140, score-0.336]
54 In the broadcast phase, the master sends its current parameter vector x(t) to its immediate neighbors. [sent-141, score-0.289]
55 Every worker computes its local gradient at its new parameter (see Fig. [sent-145, score-0.541]
56 The leaf nodes communicate their gradients to their parents, and the parent takes the gradients of the leaf nodes from the previous round (received at iteration t − 1) and averages them with its own gradient, passing this averaged gradient back up the tree. [sent-148, score-0.661]
57 Again simultaneously, each node takes the averaged gradient vectors of its children from the previous rounds, averages them with its current gradient vector, and passes the result up the spanning tree. [sent-149, score-0.535]
58 The master node receives an average of delayed gradients from the entire tree, giving rise to updates of the form (6). [sent-152, score-1.046]
59 We note that this is similar to the MPI all-reduce operation, except our implementation is non-blocking since we average delayed gradients with different delays at different nodes. [sent-153, score-0.718]
60 1 Convergence rates for delayed distributed minimization We turn now to corollaries of the results from the previous sections that show even asynchronous distributed procedures achieve asymptotically faster rates (over centralized procedures). [sent-155, score-1.2]
61 The key is that workers can pipeline updates by computing asynchronously and in parallel, so each worker can compute a low variance estimate of the gradient ∇f (x). [sent-156, score-0.782]
62 We also assume that each worker i uses m independent samples ξi (j) ∼ P , m 1 j = 1, . [sent-158, score-0.368]
63 , m, to compute the stochastic gradient as gi (t) = m j=1 ∇F (x(t); ξi (j)). [sent-161, score-0.492]
64 Let ψ(x) = 1 x 2 , assume the conditions in Theorem 1, and assume that 2 each worker uses m samples ξ ∼ P to compute the gradient it communicates to the master. [sent-165, score-0.555]
65 +√ Tm 2 Proof Noting that σ 2 = E[ ∇f (x) − gi (t) 2 ] = E[ ∇f (x) − ∇F (x; ξ) 2 ]/m = O(1/m) when workers use m independent stochastic gradient samples, the corollary is immediate. [sent-167, score-0.729]
66 As in Theorem 1, the corollary generalizes to random delays as long as Eτ 2 (t) ≤ B 2 < ∞, with τ replaced by B in the result. [sent-168, score-0.301]
67 The cyclic delayed architecture has the drawback that information from a worker can take τ = O(n) time to reach the master. [sent-170, score-0.999]
68 As a result of the communication procedure, the master receives a convex combination of the stochastic gradients evaluated at n each worker i. [sent-174, score-1.184]
69 Specifically, the master receives gradients of the form gλ (t) = i=1 λi gi (t − τ (i)) for some λ in the simplex, where τ (i) is the delay of worker i, which puts us in the setting of Theorem 2. [sent-175, score-1.151]
70 We now make the reasonable assumption that the gradient errors ∇f (x(t)) − gi (t) are uncorrelated across the nodes in the network. [sent-176, score-0.377]
71 2 In statistical applications, for example, each worker may own independent data or receive streaming data 2 from independent sources. [sent-177, score-0.362]
72 We also set ψ(x) = 1 x 2 , and observe 2 n E i=1 n 2 λi ∇f (x(t − τ (i))) − gi (t − τ (i)) = 2 i=1 λ2 E ∇f (x(t − τ (i))) − gi (t − τ (i)) i 2 2 . [sent-178, score-0.294]
73 It is also possible—but outside the scope of this extended abstract—to give fast(er) convergence rates dependent on communication costs (details can be found in the long version [2]). [sent-187, score-0.287]
74 2 Running-time comparisons We now explicitly study the running times of the centralized stochastic gradient algorithm (3), the cyclic delayed protocol with the update (5), and the locally averaged architecture with the update (6). [sent-189, score-1.395]
75 Let T be the number of units of time allocated to each algorithm, and let the centralized, cyclic delayed and locally averaged delayed algorithms complete Tcent , Tcycle and Tdist iterations, respectively, in time T . [sent-196, score-1.126]
76 We assume that the distributed methods use mcycle and mdist samples of ξ ∼ P to compute stochastic gradients. [sent-198, score-0.59]
77 For concreteness, we assume that communication is of the same order as computing the gradient of one sample ∇F (x; ξ). [sent-199, score-0.285]
78 For mcycle = Ω(n), the master requires cycle units of time to n mcycle receive one gradient update, so n Tcycle = T . [sent-203, score-0.785]
79 In the local communication framework, if each node uses mdist samples to compute a gradient, the master receives a gradient every mdist units of time, and hence mdist Tdist = T . [sent-204, score-1.208]
80 Asymptotically in the number of units of time T , both the cyclic and locally communicating stochastic optimization schemes have the same convergence rate. [sent-207, score-0.638]
81 Comparing the lower order terms, since D ≤ n for any network, the locally averaged algorithm always guarantees better performance than the cyclic algorithm. [sent-208, score-0.283]
82 √ √ √ • n-by- n grid: D = n, so the distributed method has a factor of n2/3 /n1/3 = n1/3 improvement over the cyclic architecture. [sent-210, score-0.329]
83 5 Numerical Results Though this paper focuses mostly on the theoretical analysis of delayed stochastic methods, it is important to understand their practical aspects. [sent-212, score-0.529]
84 To that end, we use the cyclic delayed method (6) to solve a somewhat large logistic regression problem: minimize f (x) = x 1 N N log(1 + exp(−bi ai , x )) i=1 subject to x 2 ≤ R. [sent-213, score-0.56]
85 We simulate the cyclic delayed optimization algorithm (5) for the problem (8) for several choices of the number of workers n and the number of samples m computed at each worker. [sent-216, score-0.856]
86 Plot (a): convergence time assuming the cost of communication to the master and gradient computation are same. [sent-221, score-0.678]
87 Plot (b): convergence time assuming the cost of communication to the master is 16 times that of gradient computation. [sent-222, score-0.678]
88 The two plots differ in the amount of time C required to communicate the parameters x between the master and the workers (relative to the amount of time to compute the gradient on one sample in the objective (8)). [sent-225, score-0.792]
89 4(a), each worker uses m = n samples to compute a stochastic gradient for the objective (8). [sent-230, score-0.741]
90 The plotted results show the delayed update (5) enjoys speedup (the ratio of time to ǫ-accuracy for an n-node system versus the centralized procedure) nearly linear in the number n of worker machines until n ≥ 15 or so. [sent-231, score-0.993]
91 Since we use the stepsize choice η(t) ∝ t/n, which yields the predicted convergence rate given by Corollary 1, the n2 m/T ≈ n3 /T term in the convergence rate presumably becomes non-negligible for larger n. [sent-232, score-0.368]
92 4(b), we study the effects of more costly communication by assuming that communication is C = 16 times more expensive than gradient computation. [sent-235, score-0.409]
93 As argued in the long version [2], we set the number of samples each worker computes to m = Cn = 16n and correspondingly reduce the damping stepsize η(t) ∝ t/(Cn). [sent-236, score-0.617]
94 In the regime of more expensive communication—as our theoretical results predict—small numbers of workers still enjoy significant speedups over a centralized method, but eventually the cost of communication and delays mitigate some of the benefits of parallelization. [sent-237, score-0.699]
95 6 Conclusion and Discussion In this paper, we have studied delayed dual averaging algorithms for stochastic optimization, showing applications of our results to distributed optimization. [sent-239, score-0.856]
96 We showed that for smooth problems, we can preserve the performance benefits of parallelization over centralized stochastic optimization even when we relax synchronization requirements. [sent-240, score-0.497]
97 Specifically, we presented methods that take advantage of distributed computational resources and are robust to node failures, communication latency, and node slowdowns. [sent-241, score-0.468]
98 Dual averaging for distributed optimization: convergence analysis and network scaling. [sent-296, score-0.356]
99 Distributed stochastic subgradient projection c algorithms for convex optimization. [sent-367, score-0.257]
100 Dual averaging methods for regularized stochastic learning and online optimization. [sent-380, score-0.286]
wordName wordTfidf (topN-words)
[('delayed', 0.371), ('worker', 0.334), ('master', 0.289), ('workers', 0.207), ('delays', 0.192), ('cyclic', 0.189), ('asynchronous', 0.183), ('centralized', 0.176), ('gradient', 0.161), ('stochastic', 0.158), ('gradients', 0.155), ('stepsize', 0.15), ('gi', 0.147), ('distributed', 0.14), ('delay', 0.138), ('erent', 0.125), ('communication', 0.124), ('nedi', 0.122), ('mcycle', 0.116), ('mdist', 0.116), ('averaging', 0.1), ('asymptotically', 0.093), ('node', 0.089), ('dekel', 0.089), ('receives', 0.088), ('dual', 0.087), ('di', 0.084), ('architecture', 0.079), ('convergence', 0.078), ('tcent', 0.07), ('tcycle', 0.07), ('tdist', 0.07), ('url', 0.069), ('spite', 0.067), ('parallel', 0.065), ('subgradient', 0.063), ('update', 0.061), ('bertsekas', 0.057), ('corollary', 0.056), ('optimization', 0.055), ('updates', 0.054), ('long', 0.053), ('assumptions', 0.051), ('units', 0.049), ('averaged', 0.048), ('spanning', 0.047), ('computes', 0.046), ('locally', 0.046), ('protocol', 0.045), ('gr', 0.044), ('agarwal', 0.043), ('penalty', 0.043), ('nodes', 0.042), ('smooth', 0.042), ('ii', 0.04), ('network', 0.038), ('communicating', 0.037), ('convex', 0.036), ('shamir', 0.036), ('architectures', 0.036), ('processors', 0.035), ('synchronization', 0.035), ('topologies', 0.035), ('tn', 0.034), ('nt', 0.034), ('rd', 0.034), ('samples', 0.034), ('corollaries', 0.033), ('procedure', 0.033), ('gn', 0.032), ('rates', 0.032), ('theorem', 0.031), ('maintains', 0.031), ('rate', 0.031), ('parallelization', 0.031), ('nemirovski', 0.031), ('depth', 0.03), ('juditsky', 0.03), ('omit', 0.029), ('fi', 0.029), ('averages', 0.029), ('satis', 0.029), ('communicate', 0.029), ('subgradients', 0.029), ('objective', 0.028), ('receive', 0.028), ('iii', 0.028), ('stream', 0.028), ('online', 0.028), ('analyze', 0.027), ('scenarios', 0.027), ('government', 0.027), ('duchi', 0.027), ('assumption', 0.027), ('robust', 0.026), ('time', 0.026), ('argmin', 0.026), ('compute', 0.026), ('speedup', 0.025), ('receiving', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
2 0.25778168 66 nips-2011-Crowdclustering
Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona
Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1
3 0.24609482 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
4 0.14081709 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1
5 0.11761653 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
6 0.10879168 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
7 0.076847836 271 nips-2011-Statistical Tests for Optimization Efficiency
8 0.07525675 202 nips-2011-On the Universality of Online Mirror Descent
9 0.072580762 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
10 0.071663037 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.071124554 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
12 0.066796884 78 nips-2011-Efficient Methods for Overlapping Group Lasso
13 0.066577874 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
14 0.064959832 29 nips-2011-Algorithms and hardness results for parallel large margin learning
15 0.064298324 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
16 0.063970454 282 nips-2011-The Fast Convergence of Boosting
17 0.062481672 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
18 0.060303636 218 nips-2011-Predicting Dynamic Difficulty
19 0.056827363 215 nips-2011-Policy Gradient Coagent Networks
20 0.056308772 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels
topicId topicWeight
[(0, 0.185), (1, -0.044), (2, -0.023), (3, -0.034), (4, -0.034), (5, 0.023), (6, -0.063), (7, -0.046), (8, -0.104), (9, -0.005), (10, 0.035), (11, -0.065), (12, -0.176), (13, -0.177), (14, 0.077), (15, 0.188), (16, -0.059), (17, 0.029), (18, 0.162), (19, 0.187), (20, 0.004), (21, -0.178), (22, -0.11), (23, -0.235), (24, 0.121), (25, -0.11), (26, 0.053), (27, -0.091), (28, -0.084), (29, -0.079), (30, -0.04), (31, 0.064), (32, 0.033), (33, -0.027), (34, 0.013), (35, -0.058), (36, 0.04), (37, -0.011), (38, 0.053), (39, 0.033), (40, 0.059), (41, 0.047), (42, -0.079), (43, 0.035), (44, -0.04), (45, 0.001), (46, 0.054), (47, -0.01), (48, 0.05), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.92409176 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
2 0.76786584 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
Author: David R. Karger, Sewoong Oh, Devavrat Shah
Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm significantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1
3 0.69361395 66 nips-2011-Crowdclustering
Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona
Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1
4 0.57043719 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1
5 0.52611023 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Author: Eric Moulines, Francis R. Bach
Abstract: We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
6 0.49501538 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
7 0.43502954 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
8 0.41842094 271 nips-2011-Statistical Tests for Optimization Efficiency
9 0.33658144 202 nips-2011-On the Universality of Online Mirror Descent
10 0.325073 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
11 0.31931984 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
12 0.31448925 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing
13 0.30963376 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
14 0.30361706 218 nips-2011-Predicting Dynamic Difficulty
15 0.26868597 95 nips-2011-Fast and Accurate k-means For Large Datasets
16 0.26539877 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
17 0.25990799 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
18 0.25641277 177 nips-2011-Multi-armed bandits on implicit metric spaces
19 0.25591311 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
20 0.25546533 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
topicId topicWeight
[(0, 0.047), (4, 0.037), (20, 0.024), (26, 0.029), (31, 0.103), (33, 0.012), (43, 0.055), (45, 0.126), (56, 0.013), (57, 0.05), (65, 0.01), (74, 0.051), (83, 0.033), (86, 0.262), (99, 0.079)]
simIndex simValue paperId paperTitle
1 0.90080661 41 nips-2011-Autonomous Learning of Action Models for Planning
Author: Neville Mehta, Prasad Tadepalli, Alan Fern
Abstract: This paper introduces two new frameworks for learning action models for planning. In the mistake-bounded planning framework, the learner has access to a planner for the given model representation, a simulator, and a planning problem generator, and aims to learn a model with at most a polynomial number of faulty plans. In the planned exploration framework, the learner does not have access to a problem generator and must instead design its own problems, plan for them, and converge with at most a polynomial number of planning attempts. The paper reduces learning in these frameworks to concept learning with one-sided error and provides algorithms for successful learning in both frameworks. A specific family of hypothesis spaces is shown to be efficiently learnable in both the frameworks. 1
same-paper 2 0.80583489 72 nips-2011-Distributed Delayed Stochastic Optimization
Author: Alekh Agarwal, John C. Duchi
Abstract: We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show n-node architectures whose optimization error in stochastic problems—in √ spite of asynchronous delays—scales asymptotically as O(1/ nT ), which is known to be optimal even in the absence of delays. 1
3 0.75138915 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
4 0.61945015 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.61880118 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
Author: Alex K. Susemihl, Ron Meir, Manfred Opper
Abstract: Bayesian filtering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1
6 0.61608005 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
7 0.61333436 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
8 0.61199361 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
9 0.60972005 32 nips-2011-An Empirical Evaluation of Thompson Sampling
10 0.60715985 263 nips-2011-Sparse Manifold Clustering and Embedding
11 0.60673541 258 nips-2011-Sparse Bayesian Multi-Task Learning
12 0.60672706 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
13 0.60660076 271 nips-2011-Statistical Tests for Optimization Efficiency
14 0.60487938 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
15 0.60433412 283 nips-2011-The Fixed Points of Off-Policy TD
16 0.60386878 180 nips-2011-Multiple Instance Filtering
17 0.60349089 102 nips-2011-Generalised Coupled Tensor Factorisation
18 0.60338074 29 nips-2011-Algorithms and hardness results for parallel large margin learning
19 0.60191119 66 nips-2011-Crowdclustering
20 0.6017133 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data