nips nips2013 nips2013-333 knowledge-graph by maker-knowledge-mining

333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent


Source: pdf

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. [sent-2, score-0.444]

2 Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. [sent-3, score-0.564]

3 It still lacks of efforts in studying them in a distributed framework. [sent-4, score-0.181]

4 We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. [sent-5, score-0.569]

5 Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. [sent-7, score-0.514]

6 However, it lacks efforts in studying them in a distributed fashion and comparing to those SGDbased and ADMM-based distributed algorithms. [sent-21, score-0.316]

7 It enjoys a strong guarantee of convergence rates for smooth or no-smooth loss functions. [sent-24, score-0.148]

8 • We analyze the tradeoff between computation and communication of DisDCA invoked by m and K. [sent-25, score-0.271]

9 Intuitively, increasing the number m of dual variables per iteration aims at reducing the number of iterations for convergence and therefore mitigating the pressure caused by communication. [sent-26, score-0.302]

10 • We present a practical variant of DisDCA and make a comparison with distributed ADMM. [sent-28, score-0.3]

11 We verify our analysis by experiments and demonstrate the effectiveness of DisDCA by comparing with SGD-based and ADMM-based distributed optimization algorithms running in the same distributed framework. [sent-29, score-0.264]

12 2 Related Work Recent years have seen the great emergence of distributed algorithms for solving machine learning related problems [2, 9]. [sent-30, score-0.137]

13 Many of them are based on stochastic gradient descent methods or alternating direction methods of multipliers. [sent-32, score-0.175]

14 Distributed SGD methods utilize the computing resources of multiple machines to handle a large number of examples simultaneously, which to some extent alleviates the high computational load per iteration of GD methods and also improve the performances of sequential SGD methods. [sent-33, score-0.163]

15 The simplest implementation of a distributed SGD method is to calculate the stochastic gradients on multiple machines, and to collect these stochastic gradients for updating the solution on a master machine. [sent-34, score-0.264]

16 ADMM has been employed for solving machine learning problems in a distributed fashion [2, 23], due to its superior convergences and performances [5, 23]. [sent-37, score-0.223]

17 The algorithms that adopt ADMM for solving the RLM problems in a distributed framework are based on the idea of global variable consensus. [sent-39, score-0.156]

18 Zhang [16] have derived new bounds on the duality gap, which have been shown to be superior to earlier results. [sent-48, score-0.123]

19 However, there still lacks of efforts in extending these types of methods to a distributed fashion and comparing them with SGD-based algorithms and ADMM-based distributed algorithms. [sent-49, score-0.299]

20 We bridge this gap by presenting and studying a distributed stochastic dual ascent algorithm. [sent-50, score-0.484]

21 [20] have recently published a paper to study the parallel speedup of a mini-batch primal and dual methods for SVM with hinge loss and establish the convergence bounds of mini-batch Pegasos and SDCA depending on the size of the mini-batch. [sent-53, score-0.417]

22 Other related but different work include [3], which presents Shotgun, a parallel coordinate descent algorithm for solving 1 regularized minimization problems. [sent-55, score-0.171]

23 All these issues are related to the tradeoff between communication and computation [22, 24]. [sent-59, score-0.247]

24 2 3 Distributed Stochastic Dual Coordinate Ascent In this section, we present a distributed stochastic dual coordinate ascent (DisDCA) algorithm and its convergence bound, and analyze the tradeoff between computation and communication. [sent-61, score-0.609]

25 We also present a practical variant of DisDCA and make a comparison with ADMM. [sent-62, score-0.197]

26 It is easy to show that the problem in (1) has a dual problem given below: max D(α), where D(α) = n α∈R 1 n n −φ∗ (−αi ) − λg ∗ i i=1 1 λn n α i xi . [sent-67, score-0.204]

27 (2) i=1 Let w∗ be the optimal solution to the primal problem in (1) and α∗ be the optimal solution to the n 1 dual problem in (2). [sent-68, score-0.288]

28 In this paper, we aim to optimize the dual problem (2) in a distributed environment where the data are distributed evenly across over K machines. [sent-70, score-0.378]

29 We denote by αk,i the associated dual variable of xk,i , and by φk,i (·), φ∗ (·) k,i the corresponding loss function and its convex conjugate. [sent-76, score-0.25]

30 , L2 hinge loss φi (z) = max(0, 1 − yi z)2 , logistic loss φi (z) = log(1 + exp(−yi z)). [sent-82, score-0.181]

31 Commonly used Lipschitz continuous functions are L1 hinge loss φi (z) = max(0, 1 − yi z) and absolute loss φi (z) = |yi − z|. [sent-83, score-0.2]

32 1 DisDCA Algorithm: The Basic Variant The detailed steps of the basic variant of the DisDCA algorithm are described by a pseudo code in Figure 1. [sent-96, score-0.176]

33 (i) At each iteration of the outer loop, m examples instead of one are randomly sampled for updating their dual variables. [sent-99, score-0.264]

34 (ii) After updating the m randomly selected dual variables, it invokes a function Reduce to collect the updated information from all K machines that accommodates naturally to the distributed environment. [sent-101, score-0.401]

35 It is this step that involves the communication among the K machines. [sent-104, score-0.122]

36 Intuitively, smaller m yields less computation and slower convergence and therefore more communication and vice versa. [sent-105, score-0.224]

37 Remark: The goal of the updates is to increase the dual objective. [sent-107, score-0.196]

38 The particular options presented in routine IncDual is to maximize the lower bounds of the dual objective. [sent-108, score-0.228]

39 Note that different from the options presented in [16], the ones in Incdual use a slightly different scalar factor mK in the quadratic term to adapt for the number of updated dual variables. [sent-120, score-0.251]

40 2 Convergence Analysis: Tradeoff between Computation and Communication In this subsection, we present the convergence bound of the DisDCA algorithm and analyze the tradeoff between computation, convergence or communication. [sent-122, score-0.166]

41 The theorem below states the convergence rate of DisDCA algorithm for smooth loss functions (The omitted proofs and other derivations can be found in supplementary materials) . [sent-123, score-0.125]

42 For a (1/γ)-smooth loss function φi and a 1-strongly convex function g(w), to obtain an p duality gap of E[P (wT ) − D(αT )] ≤ P , it suffices to have T ≥ 1 n + mK λγ log n 1 + mK λγ 1 . [sent-125, score-0.28]

43 The authors in [20] also presented an aggressive variant to adjust β adaptively and observed improvements. [sent-141, score-0.192]

44 3 we develop a practical variant that enjoys more speed-up compared to the basic variant and their aggressive variant. [sent-143, score-0.451]

45 Tradeoff between Computation and Communication We are now ready to discuss the tradeoff between computation and communication based on the worst case analysis as indicated by Theo4 rem 1. [sent-144, score-0.247]

46 For the analysis of tradeoff between computation and communication invoked by the number of samples m and the number of machines K, we fix the number of examples n and the number of dimensions d. [sent-145, score-0.342]

47 We note that the communication cost is proportional to the number of iterations 1/(1+τ ) n T = Ω mK + n γ , while the computation cost per node is proportional to mT = 1/(1+τ ) τ /(1+τ ) n Ω K + mn γ due to that each iteration involves m examples. [sent-155, score-0.318]

48 When m ≤ Θ n K , the communication cost decreases as m increases, and the computation costs increases as m inτ /(1+τ ) , creases, though it is dominated by Ω(n/K). [sent-156, score-0.211]

49 When the value of m is greater than Θ n K 1/(1+τ ) the communication cost is dominated by Ω n γ , then increasing the value of m will become less influential on reducing the communication cost; while the computation cost would blow up substantially. [sent-157, score-0.39]

50 Similarly, we can also understand how the number of nodes K affects the tradeoff between the com1/(1+τ ) 2 ˜ ˜ n munication cost, proportional to Ω(KT ) = Ω m + Kn γ , and the computation cost, proportional to Ω n K + mn1/(1+τ ) γ nτ /(1+τ ) m . [sent-158, score-0.159]

51 When K ≤ Θ , as K increases the computation cost would decrease and the communication cost would increase. [sent-159, score-0.221]

52 When it is greater than Θ the computation cost would be dominated by Ω reducing the computation cost would diminish. [sent-160, score-0.158]

53 Meanwhile, increasing the number of samples m would increase the computation cost and similarly increasing the number of nodes K would increase the communication cost. [sent-162, score-0.245]

54 To verify the tradeoff of communication and computation, we present empirical studies in Section 4. [sent-164, score-0.245]

55 Although the smooth loss functions are the most interesting, we present in the theorem below about the convergence of DisDCA for Lipschitz continuous loss functions. [sent-165, score-0.2]

56 Remark: In this case, the effective region of m and K is mK ≤ Θ(nλ P ) which is narrower than that for smooth loss functions, especially when P γ. [sent-168, score-0.138]

57 Again, the practical variant presented in next mK section yields more speed-up. [sent-170, score-0.197]

58 , m Randomly pick i ∈ {1, · · · , nk } and let ij = i Find ∆αk,i by calling routine IncDual(w = uj−1 , scl = k) t t−1 1 t Update αk,i = αk,i + ∆αk,i and update uj = uj−1 + λnk ∆αk,i xk,i t t Figure 2: the updates at the t-th iteration of the practical variant of DisDCA 3. [sent-175, score-0.539]

59 2 A Practical Variant We note that in Algorithm 1, when updating the values of the following sampled dual variables, the algorithm does not use the updated information but instead wt−1 from last iteration. [sent-178, score-0.245]

60 Therefore a potential improvement would be leveraging the up-to-date information for updating the dual variables. [sent-179, score-0.208]

61 At 0 the beginning of the iteration t, all wk , k = 1, · · · , K are synchronized with the global wt−1 . [sent-181, score-0.14]

62 j−1 Then in individual machines, the j-th sampled dual variable is updated by IncDual(wk , k) and j j j−1 1 the local copy wk is also updated by wk = wk + λnk ∆αk,ij xk,ij for updating the next dual variable. [sent-182, score-0.628]

63 At the end of the iteration, the local solutions are synchronized to the global variable K m 1 t wt = wt−1 + λn k=1 j=1 ∆αk,ij xk,ij . [sent-183, score-0.184]

64 It is important to note that the scalar factor in IncDual is now k because the dual variables are updated incrementally and there are k processes running parallell. [sent-184, score-0.249]

65 The experiments in Section 4 verify the improvements of the practical variant vs the basic variant. [sent-186, score-0.297]

66 However, next we establish a connection between DisDCA and ADMM that sheds light on the motivation behind the practical variant and the differences between the two algorithms. [sent-188, score-0.197]

67 The updates presented in Algorithm 1 are solutions to maximizing the lower bounds of the above objective function by decoupling the m dual variables. [sent-190, score-0.196]

68 It is not difficult to derive that the dual problem in (3) has the following primal problem (a detailed derivation and others can be found in supplementary materials): DisDCA: 1 min w nk m λ φi (xi w) + w− 2 i=1 2 m w t−1 t−1 αi xi − 1/(λnk ) . [sent-191, score-0.391]

69 i=1 (4) 2 We refer to wt as the penalty solution. [sent-192, score-0.165]

70 (1) Both aim at solving the same type of problem to increase the dual objective or decrease the primal objective. [sent-196, score-0.288]

71 In DisDCA, wt−1 is ˆ ˆ constructed by subtracting from the global solution the local solution defined by the dual variables α, while in ADMM it is constructed by subtracting from the global solution the local Lagrangian variables u. [sent-199, score-0.299]

72 Now, let us explain the practical variant of DisDCA from the viewpoint of inexactly solving the ∗ subproblem (4). [sent-201, score-0.273]

73 In fact, the updates at the t-th iteration of the practical variant of DisDCA is to optimize the subproblem (4) by the 0 SDCA algorithm with only one pass of the sampled data points and an initialization of αi = t−1 αi , i = 1 . [sent-206, score-0.301]

74 It means that the initial primal solution for solving the subproblem (3) is m t−1 1 u0 = wt−1 + λnk i=1 αi xi = wt−1 . [sent-210, score-0.207]

75 This from another point of view validates the practical variant of DisDCA. [sent-215, score-0.197]

76 The experiments are performed on two large data sets with different number of features, covtype and kdd. [sent-220, score-0.154]

77 For covtype data, we use 522, 911 examples for training. [sent-223, score-0.172]

78 We apply the algorithms to solving two SVM formulations, namely L2 -SVM with hinge loss square and L1 -SVM with hinge loss, to demonstrate the capabilities of DisDCA in solving smooth loss functions and Lipschitz continuous loss functions. [sent-224, score-0.396]

79 In the legend of figures, we use DisDCA-b to denote the basic variant, DisDCA-p to denote the practical variant, and DisDCA-a to denote the aggressive variant of DisDCA [20]. [sent-225, score-0.291]

80 It is notable that the effective region of m and K of the practical variant is much larger than that of the basic variant. [sent-231, score-0.31]

81 The speed-up gain on kdd data by increasing m is even larger because the communication cost is much higher. [sent-233, score-0.213]

82 In supplement, we present more results on visualizing the communication and computation tradeoff. [sent-234, score-0.161]

83 The Practical Variant vs The Basic Variant To further demonstrate the usefulness of the practical variant, we present a comparison between the practical variant and the basic variant for optimizing 3 0 second means less than 1 second. [sent-235, score-0.487]

84 We exclude the time for computing the duality gap at each iteration. [sent-236, score-0.202]

85 75 20 40 60 80 duality gap duality gap DisDCA−b (covtype, L2SVM, K=10, λ=10 ) 0. [sent-242, score-0.404]

86 5 duality gap duality gap 1 (d) varying K 100 1 100 0. [sent-255, score-0.424]

87 8 primal obj 80 DisDCA−b (covtype, L2SVM, m=102, λ=10−6) 40 60 DisDCA−p (covtype, L2SVM, m=103, λ=10−3) duality gap dualtiy gap 0. [sent-257, score-0.41]

88 76 primal obj 0 0 primal obj m=1 m=10 m=102 m=103 m=104 covtype, L2SVM, K=10, m=104, λ=10−6 DisDCA−p (L2SVM, K=10, λ=10−3) 1 primal obj 1 duality gap duality gap DisDCA−b (covtype, L2SVM, K=10, λ=10−3) DisDCA−p ADMM−s (η=100) Pegasos 0. [sent-265, score-0.791]

89 2 0 20 40 60 80 100 number of iterations (*100) (f) Different Algorithms Figure 3: (a,b): duality gap with varying m; (d,e): duality gap with varying K; (c, f) comparison of different algorithms for optimizing SVMs. [sent-268, score-0.499]

90 We also include the performances of the aggressive variant proposed in [20], by applying the aggressive updates on the m sampled examples in each machine without incurring additional communication cost. [sent-271, score-0.465]

91 The results show that the practical variant converges much faster than the basic variant and the aggressive variant. [sent-272, score-0.428]

92 Comparison with other baselines Lastly, we compare DisDCA with SGD-based and ADMMbased distributed algorithms running in the same distributed framework. [sent-273, score-0.227]

93 For optimizing L2 -SVM, we implement the stochastic average gradient (SAG) algorithm [15], which also enjoys a linear convergence for smooth and strongly convex problems. [sent-274, score-0.237]

94 5 Conclusions We have presented a distributed stochastic dual coordinate descent algorithm and its convergence rates, and analyzed the tradeoff between computation and communication. [sent-285, score-0.586]

95 The practical variant has substantial improvements over the basic variant and other variants. [sent-286, score-0.373]

96 4 The primal objective of Pegasos on covtype is above the display range. [sent-288, score-0.236]

97 A dual algorithm for the solution of nonlinear variational problems via finite element approximation. [sent-342, score-0.189]

98 A dual coordinate descent method for large-scale linear svm. [sent-357, score-0.264]

99 On the convergence of the coordinate descent method for convex differentiable minimization. [sent-388, score-0.154]

100 Efficient distributed linear classification algorithms via the alternating direction method of multipliers. [sent-470, score-0.166]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('disdca', 0.773), ('admm', 0.217), ('mk', 0.184), ('dual', 0.172), ('covtype', 0.154), ('wt', 0.14), ('variant', 0.137), ('duality', 0.123), ('communication', 0.122), ('incdual', 0.108), ('nk', 0.105), ('distributed', 0.103), ('tradeoff', 0.086), ('primal', 0.082), ('gap', 0.079), ('sdca', 0.079), ('scl', 0.077), ('dca', 0.075), ('practical', 0.06), ('ascent', 0.059), ('wk', 0.058), ('loss', 0.056), ('coordinate', 0.056), ('aggressive', 0.055), ('rlm', 0.055), ('stochastic', 0.054), ('performances', 0.054), ('machines', 0.053), ('sag', 0.047), ('obj', 0.047), ('hinge', 0.047), ('sgd', 0.047), ('ut', 0.046), ('pegasos', 0.045), ('subproblem', 0.042), ('convergence', 0.04), ('computation', 0.039), ('alternating', 0.039), ('basic', 0.039), ('iteration', 0.038), ('updated', 0.037), ('verify', 0.037), ('updating', 0.036), ('descent', 0.036), ('gd', 0.036), ('calling', 0.035), ('region', 0.035), ('efforts', 0.035), ('solving', 0.034), ('kdd', 0.034), ('routine', 0.033), ('fashion', 0.032), ('xi', 0.032), ('optimizing', 0.03), ('cost', 0.03), ('uj', 0.03), ('mpi', 0.029), ('smooth', 0.029), ('increasing', 0.027), ('lacks', 0.026), ('iterations', 0.025), ('penalty', 0.025), ('tak', 0.025), ('synchronized', 0.025), ('regularized', 0.025), ('direction', 0.024), ('vs', 0.024), ('updates', 0.024), ('invoked', 0.024), ('enjoys', 0.023), ('vice', 0.023), ('options', 0.023), ('gradient', 0.022), ('yi', 0.022), ('convex', 0.022), ('notable', 0.021), ('norm', 0.021), ('running', 0.021), ('option', 0.021), ('dominated', 0.02), ('parallel', 0.02), ('varying', 0.02), ('smola', 0.02), ('subtracting', 0.019), ('scalar', 0.019), ('lipschitz', 0.019), ('global', 0.019), ('iterate', 0.019), ('continuous', 0.019), ('effective', 0.018), ('capabilities', 0.018), ('xing', 0.018), ('examples', 0.018), ('studying', 0.017), ('implement', 0.017), ('subsection', 0.017), ('rd', 0.017), ('cores', 0.017), ('solution', 0.017), ('proportional', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

2 0.1585996 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1

3 0.15344207 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin

Abstract: In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.

4 0.12885548 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1

5 0.12618284 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1

6 0.10974785 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

7 0.10193785 268 nips-2013-Reflection methods for user-friendly submodular optimization

8 0.085868984 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

9 0.078504778 158 nips-2013-Learning Multiple Models via Regularized Weighting

10 0.076560773 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

11 0.073025107 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

12 0.06913317 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

13 0.068594292 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

14 0.063562155 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

15 0.061708849 318 nips-2013-Structured Learning via Logistic Regression

16 0.060044199 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

17 0.057007138 89 nips-2013-Dimension-Free Exponentiated Gradient

18 0.052462727 65 nips-2013-Compressive Feature Learning

19 0.047938962 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

20 0.047885913 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.136), (1, 0.004), (2, 0.081), (3, -0.028), (4, 0.026), (5, 0.04), (6, -0.136), (7, 0.067), (8, 0.128), (9, 0.046), (10, -0.002), (11, 0.112), (12, 0.045), (13, -0.106), (14, -0.043), (15, 0.088), (16, -0.047), (17, 0.058), (18, 0.055), (19, 0.043), (20, 0.021), (21, 0.005), (22, 0.026), (23, -0.009), (24, -0.007), (25, -0.004), (26, -0.042), (27, -0.016), (28, -0.005), (29, -0.015), (30, 0.056), (31, 0.057), (32, 0.03), (33, 0.03), (34, 0.001), (35, 0.087), (36, -0.105), (37, 0.036), (38, 0.024), (39, -0.042), (40, -0.046), (41, -0.055), (42, -0.049), (43, 0.143), (44, -0.061), (45, -0.007), (46, 0.068), (47, -0.058), (48, -0.035), (49, -0.134)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93782598 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

2 0.8275646 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1

3 0.76841342 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Author: Rie Johnson, Tong Zhang

Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1

4 0.67262441 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1

5 0.65369445 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1

6 0.63086951 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

7 0.62702882 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

8 0.51974326 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

9 0.49625888 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

10 0.46741772 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

11 0.45477754 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

12 0.45281023 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

13 0.43203369 245 nips-2013-Pass-efficient unsupervised feature selection

14 0.42848384 268 nips-2013-Reflection methods for user-friendly submodular optimization

15 0.4274686 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

16 0.42597464 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

17 0.42290774 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

18 0.42234528 318 nips-2013-Structured Learning via Logistic Regression

19 0.41292509 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

20 0.40647143 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.172), (2, 0.015), (16, 0.035), (30, 0.012), (33, 0.144), (34, 0.084), (36, 0.016), (41, 0.057), (49, 0.028), (56, 0.092), (70, 0.039), (85, 0.046), (89, 0.029), (93, 0.087), (95, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85685337 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited

Author: Matthias Hein, Simon Setzer, Leonardo Jost, Syama Sundar Rangapuram

Abstract: Hypergraphs allow one to encode higher-order relationships in data and are thus a very flexible modeling tool. Current learning methods are either based on approximations of the hypergraphs via graphs or on tensor methods which are only applicable under special conditions. In this paper, we present a new learning framework on hypergraphs which fully uses the hypergraph structure. The key element is a family of regularization functionals based on the total variation on hypergraphs. 1

2 0.85068655 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Author: Min Xu, Tao Qin, Tie-Yan Liu

Abstract: In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection bias that harms the search engine by decreasing expected revenue and “estimation of the largest mean” (ELM) bias that harms the advertisers by increasing game-theoretic player-regret. We then propose simple bias-correction methods with benefits to both the search engine and the advertisers. 1

3 0.83339393 324 nips-2013-The Fast Convergence of Incremental PCA

Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund

Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1

same-paper 4 0.82664061 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

5 0.78134179 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

Author: Alexander Zimin, Gergely Neu

Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1

6 0.76719743 99 nips-2013-Dropout Training as Adaptive Regularization

7 0.76241386 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

8 0.75929499 215 nips-2013-On Decomposing the Proximal Map

9 0.75656235 201 nips-2013-Multi-Task Bayesian Optimization

10 0.75360882 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

11 0.75177324 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization

12 0.75123775 251 nips-2013-Predicting Parameters in Deep Learning

13 0.7507574 69 nips-2013-Context-sensitive active sensing in humans

14 0.75017428 149 nips-2013-Latent Structured Active Learning

15 0.74959952 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

16 0.74949503 5 nips-2013-A Deep Architecture for Matching Short Texts

17 0.74937004 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

18 0.74729502 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

19 0.747042 30 nips-2013-Adaptive dropout for training deep neural networks

20 0.74686539 318 nips-2013-Structured Learning via Logistic Regression