nips nips2013 nips2013-175 knowledge-graph by maker-knowledge-mining

175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Source: pdf

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to ﬁnd an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. [sent-2, score-0.441]

2 In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. [sent-3, score-0.121]

3 In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. [sent-4, score-0.418]

4 A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. [sent-6, score-0.61]

5 Theoretical analysis shows that EMGD is able to ﬁnd an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. [sent-7, score-0.396]

6 The problem of convex optimization is typically given as min F (w), w∈W where W is a convex domain, and F (·) is a convex function. [sent-9, score-0.357]

7 In most cases, the optimization algorithm for solving the above problem is an iterative process, and the convergence rate is characterized by the iteration complexity, i. [sent-10, score-0.194]

8 In this study, we focus on ﬁrst order methods, where we only have the access to the (stochastic) gradient of the objective function. [sent-13, score-0.27]

9 For most convex optimization problems, the iteration complexity of an optimization algorithm depends on the following two factors. [sent-14, score-0.329]

10 For example, do we have access to the full gradient or the stochastic gradient of F (·)? [sent-20, score-0.572]

11 The optimal iteration complexities for some popular combinations of the above two factors are summarized in Table 1 and elaborated in the related work section. [sent-21, score-0.111]

12 We observe that when the objective function is smooth (and strongly convex), the convergence rate for full gradients is much faster than that for stochastic gradients. [sent-22, score-0.608]

13 On the other hand, the evaluation of a stochastic gradient is usually signiﬁcantly more efﬁcient than that of a full gradient. [sent-23, score-0.357]

14 Thus, replacing full gradients with stochastic gradients essentially trades the number of iterations with a low computational cost per iteration. [sent-24, score-0.516]

15 1 Table 1: The optimal iteration complexity of convex optimization. [sent-25, score-0.218]

16 For the optimization problems that are ill-conditioned, the condition number κ can be very large, leading to many evaluations of full gradients, an operation that is computationally expensive for large data sets. [sent-29, score-0.186]

17 To reduce the computational cost, we are interested in the possibility of making the number of full gradients required independent from κ. [sent-30, score-0.233]

18 Although the √ O( κ log 1 ) rate is in general not improvable for any ﬁrst order method, we bypass this difﬁculty by ǫ allowing the algorithm to have access to both full and stochastic gradients. [sent-31, score-0.324]

19 Our objective is to reduce √ the iteration complexity from O( κ log 1 ) to O(log 1 ) by replacing most of the evaluations of full ǫ ǫ gradients with the evaluations of stochastic gradients. [sent-32, score-0.608]

20 Under the assumption that stochastic gradients can be computed efﬁciently, this tradeoff could lead to a signiﬁcant improvement in computational efﬁciency. [sent-33, score-0.281]

21 To this end, we developed a novel optimization algorithm named Epoch Mixed Gradient Descent (EMGD). [sent-34, score-0.091]

22 It divides the optimization process into a sequence of epochs, an idea that is borrowed from the epoch gradient descent [9]. [sent-35, score-0.554]

23 At each epoch, the proposed algorithm performs mixed gradient descent by evaluating one full gradient and O(κ2 ) stochastic gradients. [sent-36, score-0.72]

24 It achieves a constant reduction in the optimization error for every epoch, leading to a linear convergence rate. [sent-37, score-0.106]

25 Our analysis shows that EMGD is able to ﬁnd an ǫ-optimal solution by computing O(log 1 ) full gradients and ǫ O(κ2 log 1 ) stochastic gradients. [sent-38, score-0.396]

26 In other words, with the help of stochastic gradients, the number ǫ √ of full gradients required is reduced from O( κ log 1 ) to O(log 1 ), independent from the condition ǫ ǫ number. [sent-39, score-0.418]

27 2 Related Work During the last three decades, there have been signiﬁcant advances in convex optimization [3,15,17]. [sent-40, score-0.161]

28 We ﬁrst discuss deterministic optimization, where the gradient of the objective function is available. [sent-42, score-0.246]

29 For the general convex and Lipschitz continuous optimization problem, the iteration complexity of gradient (subgradient) descent is O( ǫ1 ), which is optimal up to constant factors [15]. [sent-43, score-0.546]

30 When 2 the objective function is convex and smooth, the optimal optimization scheme is the accelerated L gradient descent developed by Nesterov, whose iteration complexity is O( √ǫ ) [16, 18]. [sent-44, score-0.619]

31 With slight modiﬁcations, the accelerated gradient descent algorithm can also be applied to optimize the smooth √ and strongly convex objective function, whose iteration complexity is O( κ log 1 ) and is in general ǫ not improvable [17, 19]. [sent-45, score-0.756]

32 The objective of our work is to reduce the number of accesses to the full gradients by exploiting the availability of stochastic gradients. [sent-46, score-0.495]

33 In stochastic optimization, we have access to the stochastic gradient, which is an unbiased estimate of the full gradient [14]. [sent-47, score-0.525]

34 Similar to the case in deterministic optimization, if the objective function is convex and Lipschitz continuous, stochastic gradient (subgradient) descent is the optimal algorithm and the iteration complexity is also O( ǫ1 ) [14, 15]. [sent-48, score-0.682]

35 When the objective function is λ-strongly 2 1 convex, the algorithms proposed in very recent works [9, 10, 21, 26] achieve the optimal O( λǫ ) iteration complexity [1]. [sent-49, score-0.175]

36 Since the convergence rate of stochastic optimization is dominated by the randomness in the gradient [6, 11], smoothness usually does not lead to a faster convergence rate for stochastic optimization. [sent-50, score-0.657]

37 A variant of stochastic optimization is the “semi-stochastic” approximation, which interleave stochastic gradient descent and full gradient descent [12]. [sent-51, score-0.92]

38 In the strongly convex case, if the stochastic gradients are taken at a decreasing rate, the convergence rate can be improved 1 to approach O( λ√ǫ ) [13]. [sent-52, score-0.524]

39 2 From the above discussion, we observe that the iteration complexity in stochastic optimization is polynomial in 1 , making it difﬁcult to ﬁnd high-precision solutions. [sent-53, score-0.294]

40 However, when the objective ǫ function is strongly convex and can be written as a sum of a ﬁnite number of functions, i. [sent-54, score-0.228]

41 , F (w) = 1 n n fi (w), (1) i=1 where each fi (·) is smooth, the iteration complexity of some speciﬁc algorithms may exhibit a logarithmic dependence on 1 , i. [sent-56, score-0.245]

42 The two very recent works are the stochastic ǫ average gradient (SAG) [22], whose iteration complexity is O(n log 1 ), provided n ≥ 8κ, and the ǫ stochastic dual coordinate ascent (SDCA) [23], whose iteration complexity is O((n + κ) log 1 ). [sent-59, score-0.728]

43 1 ǫ Under approximate conditions, the incremental gradient method [2] and the hybrid method [5] can also minimize the function in (1) with a linear convergence rate. [sent-60, score-0.25]

44 But those algorithms usually treat one pass of all fi ’s (or the subset of fi ’s) as one iteration, and thus have high computational cost per iteration. [sent-61, score-0.14]

45 The ﬁrst one is a gradient oracle Og , which for a given input point w ∈ W returns the gradient ∇F (w), that is, Og (w) = ∇F (w). [sent-65, score-0.444]

46 The second one is a function oracle Of , each call of which returns a random function f (·), such that F (w) = Ef [f (w)], ∀w ∈ W, and f (·) is L-smooth, that is, ∇f (w) − ∇f (w′ ) ≤ L w − w′ , ∀w, w′ ∈ W. [sent-67, score-0.115]

47 (2) Although we do not deﬁne a stochastic gradient oracle directly, the function oracle Of allows us to evaluate the stochastic gradient of F (·) at any point w ∈ W. [sent-68, score-0.794]

48 Notice that the assumption about the function oracle Of implies that the objective function F (·) is also L-smooth. [sent-69, score-0.153]

49 Similar to the epoch gradient descent (EGD) [9], we divided the optimization process into a sequence of epochs (step 3 to step 10). [sent-80, score-0.56]

50 While the number of accesses to the gradient oracle in EGD increases exponentially over the epoches, the number of accesses to the two oracles in EMGD is ﬁxed. [sent-81, score-0.433]

51 1 In order to apply SDCA, we need to assume each function fi is λ-strongly convex, so that we can rewrite fi (w) as gi (w) + λ w 2 , where gi (w) = fi (w) − λ w 2 is convex. [sent-82, score-0.244]

52 2 2 3 Algorithm 1 Epoch Mixed Gradient Descent (EMGD) Input: step size η, the initial domain size ∆1 , the number of iterations T per epoch, and the number of epoches m ¯ 1: Initialize w1 = 0 2: for k = 1, . [sent-83, score-0.122]

53 , m do k ¯ 3: Set w1 = wk ¯ 4: Call the gradient oracle Og to obtain ∇F (wk ) 5: for t = 1, . [sent-86, score-0.419]

54 At each iteration t of epoch k, we call the function oracle Of to obtain a random function ftk (·) and deﬁne the mixed k gradient at the current solution wt as k ˜k ¯ ¯ gt = ∇F (wk ) + ∇ftk (wt ) − ∇ftk (wk ), which involves both the full gradient and the stochastic gradient. [sent-90, score-2.046]

55 The mixed gradient can be divided k ¯ ¯ into two parts: the deterministic part ∇F (wk ) and the stochastic part ∇ftk (wt ) − ∇ftk (wk ). [sent-91, score-0.415]

56 Due k to the smoothness property of ft (·) and the shrinkage of the domain size, the norm of the stochastic part is well bounded, which is the reason why our algorithm can achieve linear convergence. [sent-92, score-0.244]

57 k Based on the mixed gradient, we update wt by a gradient mapping over a shrinking domain (i. [sent-93, score-0.965]

58 Since the updating is similar to the standard gradient descent except for the domain constraint, we refer to it as mixed gradient descent for short. [sent-96, score-0.658]

59 At the end of the iteration for epoch k, we compute the average value of T + 1 solutions, instead of T solutions, and update √ the domain size by reducing a factor of 2. [sent-97, score-0.303]

60 Let wm+1 be the solution returned by Algorithm 1 after m epoches that has m accesses to oracle Og and mT accesses to oracle Of . [sent-103, score-0.446]

61 2 2 Theorem 1 immediately implies that EMGD is able to achieve an ǫ optimization error by computing O(log 1 ) full gradients and O(κ2 log 1 ) stochastic gradients. [sent-105, score-0.441]

62 ǫ ǫ 4 Table 2: The computational complexity for minimizing Nesterov’s algorithm [17] √ O κn log 1 ǫ 3. [sent-106, score-0.083]

63 4 EMGD 2 O (n + κ ) log 1 ǫ 1 n n i=1 fi (w) SAG (n ≥ 8κ) [22] O n log 1 ǫ SDCA [23] O (n + κ) log 1 ǫ Comparisons Compared to the optimization algorithms that only rely on full gradients [17], the number of full √ gradients needed in EMGD is O(log 1 ) instead of O( κ log 1 ). [sent-107, score-0.73]

64 Compared to the optimization ǫ ǫ algorithms that only rely on stochastic gradients [9,10,21], EMGD is more efﬁcient since it achieves a linear convergence rate. [sent-108, score-0.387]

65 The proposed EMGD algorithm can also be applied to the special optimization problem considered n 1 in [22, 23], where F (w) = n i=1 fi (w). [sent-109, score-0.133]

66 To make quantitative comparisons, let’s assume the full gradient is n times more expensive to compute than the stochastic gradient. [sent-110, score-0.357]

67 As can be seen, the computational complexity of EMGD is lower than Nesterov’s algorithm [17] as long as the condition number κ ≤ n2/3 , the complexity of SAG [22] is lower than Nesterov’s algorithm if κ ≤ n/8, and the complexity of SDCA [23] is lower than Nesterov’s algorithm if κ ≤ n2 . [sent-112, score-0.172]

68 Unlike SAG and SDCA that only work for unconstrained optimization problem, the proposed algorithm works for both constrained and unconstrained optimization problems, provided that the constrained problem in Step 8 can be solved efﬁciently. [sent-116, score-0.176]

69 The only step in Algorithm 1 that has dependence on n is step 4 for computing the gradient ¯ ∇F (wk ). [sent-120, score-0.173]

70 The linear convergence of SAG and SDCA only holds in expectation, whereas the linear convergence of EMGD holds with a high probability, which is much stronger. [sent-124, score-0.086]

71 4 The Analysis In the proof, we frequently use the following property of strongly convex functions [9]. [sent-125, score-0.173]

72 Let f (x) be a λ-strongly convex function over the domain X , and x∗ argminx∈X f (x). [sent-127, score-0.128]

73 Deﬁne ¯ ¯ g = ∇F (w), F (w) = F (w) − w, g , and gt (w) = ft (w) − w, ∇ft (w) . [sent-143, score-0.248]

74 And the mixed gradient can be rewritten as ˜ gk = g + ∇gt (wt ). [sent-145, score-0.289]

75 Then, the updating rule given in Algorithm 1 becomes wt+1 = argmin ¯ w∈W∩B(w,∆) η w − wt , g + ∇gt (wt ) + (10) 1 w − wt 2 . [sent-146, score-1.343]

76 Using the fact that w∗ ∈ W ∩ ¯ B(w; ∆) and Lemma 1 (with x∗ = wt+1 and x = w∗ ), we have 1 η wt+1 − wt , g + ∇gt (wt ) + wt+1 − wt 2 2 (12) 1 ∗ 1 ∗ ≤η w − wt , g + ∇gt (wt ) + w − wt 2 − w∗ − wt+1 2 . [sent-148, score-2.656]

77 2η 2η 2 (12) 2 ≤ g + ∇gt (wt ), wt − wt+1 + 6 (14) Combining (13) and (14), we have F (wt ) − F (w∗ ) 2 wt+1 − w∗ 2 λ − wt − w ∗ 2 2η 2 η + g, wt − wt+1 + ∇gt (wt ) 2 + ∇F (wt ) − ∇gt (wt ), wt − w∗ . [sent-150, score-2.656]

78 2 By adding the inequalities of all iterations, we have ≤ wt − w ∗ 2η − T t=1 (F (wt ) − F (w∗ )) 2 ¯ w − w∗ 2η ≤ + η 2 wT +1 − w∗ 2η − 2 − λ 2 T 2 wt − w ∗ t=1 ¯ + g, w − wT +1 (15) T T ∇gt (wt ) t=1 2 + t=1 ∇F (wt ) − ∇gt (wt ), wt − w∗ . [sent-151, score-2.015]

79 The upper bound of AT is given by T T AT = t=1 ∇gt (wt ) 2 = t=1 ¯ ∇ft (wt ) − ∇ft (w) 2 (2) ≤ L2 T t=1 ¯ wt − w 2 ≤ T L2 ∆2 . [sent-155, score-0.664]

80 Recall the deﬁnition of F (·) and gt (·) in (9). [sent-173, score-0.189]

81 Based on our assumption about the function oracle Of , it is straightforward to check that V1 , . [sent-174, score-0.098]

82 The value of Vt can be bounded by |Vt | ≤ ∇F (wt ) − ∇gt (wt ) wt − w ∗ ≤ ¯ ¯ 2∆ ( ∇F (wt ) − ∇F (w) + ∇ft (wt ) − ∇ft (w) ) ≤ ¯ 4L∆ wt − w ≤ 4L∆2 . [sent-181, score-1.328]

83 √ By choosing η = 1/[L T ], we have T +1 t=1 (F (wt ) − F (w∗ )) ≤ L∆2 √ 1 T + 1 + 4 2T ln δ (5) ≤ 6L∆2 1 2T ln , δ (20) where in the second inequality we use the condition δ ≤ e−1/2 in (5). [sent-184, score-0.151]

84 By Jensen’s inequality, we have T +1 (20) 6L 2 ln 1/δ 1 , (F (wt ) − F (w∗ )) ≤ ∆2 √ F (w) − F (w∗ ) ≤ T + 1 t=1 T +1 and therefore w − w∗ 2 (6) ≤ 12L 2 ln 1/δ 2 √ . [sent-185, score-0.092]

85 2 Conclusion and Future Work In this paper, we consider how to reduce the number of full gradients needed for smooth and strongly convex optimization problems. [sent-187, score-0.553]

86 Under the assumption that both the gradient and the stochastic gradient are available, a novel algorithm named Epoch Mixed Gradient Descent (EMGD) is proposed. [sent-188, score-0.5]

87 Theoretical analysis shows that with the help of stochastic gradients, we are able to reduce the num√ ber of gradients needed from O( κ log 1 ) to O(log 1 ). [sent-189, score-0.355]

88 , a sum of n smooth functions, EMGD has lower computational cost than the full gradient method [17], if the condition number κ ≤ n2/3 . [sent-192, score-0.34]

89 Furthermore, if there exists a strongly convex regularizer in the objective function, which happens in many machine learning problems [8], the knowledge of the regularizer itself allows us to ﬁnd an upper bound of κ. [sent-199, score-0.256]

90 Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. [sent-208, score-0.366]

91 A new class of incremental gradient methods for least squares problems. [sent-213, score-0.189]

92 Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: a generic algorithmic framework. [sent-233, score-0.515]

93 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-252, score-0.141]

94 On solutions of stochastic programming problems by descent procedures with stochastic and deterministic directions. [sent-265, score-0.393]

95 Rates of convergence of semi-stochastic approximation procedures for solving stochastic optimization problems. [sent-270, score-0.246]

96 A method for unconstrained convex minimization problem with the rate of convergence O(1/k2 ). [sent-287, score-0.193]

97 Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization. [sent-294, score-0.098]

98 Making gradient descent optimal for strongly convex stochastic optimization. [sent-315, score-0.579]

99 A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. [sent-322, score-0.369]

100 O(log T ) projections for stochastic optimization of smooth and strongly convex functions. [sent-347, score-0.431]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.664), ('emgd', 0.416), ('epoch', 0.212), ('gt', 0.189), ('gradient', 0.173), ('ftk', 0.159), ('gradients', 0.155), ('wk', 0.148), ('sdca', 0.13), ('stochastic', 0.126), ('sag', 0.121), ('og', 0.099), ('oracle', 0.098), ('mixed', 0.098), ('convex', 0.098), ('descent', 0.092), ('accesses', 0.081), ('strongly', 0.075), ('epoches', 0.07), ('fi', 0.07), ('smooth', 0.069), ('optimization', 0.063), ('iteration', 0.061), ('ft', 0.059), ('full', 0.058), ('nesterov', 0.057), ('wm', 0.055), ('objective', 0.055), ('ln', 0.046), ('complexity', 0.044), ('convergence', 0.043), ('access', 0.042), ('condition', 0.04), ('bt', 0.039), ('log', 0.039), ('egd', 0.035), ('improvable', 0.032), ('domain', 0.03), ('smoothness', 0.029), ('named', 0.028), ('martingale', 0.028), ('subgradient', 0.027), ('composite', 0.027), ('rate', 0.027), ('juditsky', 0.026), ('evaluations', 0.025), ('vt', 0.025), ('unconstrained', 0.025), ('lipschitz', 0.024), ('nemirovski', 0.024), ('inequalities', 0.023), ('ci', 0.022), ('iterations', 0.022), ('jensen', 0.021), ('jin', 0.021), ('complexities', 0.021), ('reduce', 0.02), ('epochs', 0.02), ('inequality', 0.019), ('hessian', 0.019), ('deterministic', 0.018), ('solution', 0.018), ('sn', 0.018), ('rewritten', 0.018), ('accelerated', 0.018), ('hybrid', 0.018), ('palomar', 0.017), ('consequentially', 0.017), ('friedlander', 0.017), ('ghadimi', 0.017), ('interleave', 0.017), ('lansing', 0.017), ('lijun', 0.017), ('moduli', 0.017), ('sssr', 0.017), ('siam', 0.017), ('ai', 0.017), ('programming', 0.017), ('gi', 0.017), ('call', 0.017), ('storage', 0.016), ('elicited', 0.016), ('doklady', 0.016), ('ltd', 0.016), ('incremental', 0.016), ('argmin', 0.015), ('optimal', 0.015), ('mehrdad', 0.015), ('ascent', 0.015), ('needed', 0.015), ('convexity', 0.015), ('borrowed', 0.014), ('num', 0.014), ('eldar', 0.014), ('elaborated', 0.014), ('regularizer', 0.014), ('procedures', 0.014), ('eigenvalues', 0.014), ('lemma', 0.014), ('comparisons', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

2 0.57537115 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin

Abstract: In this paper, we are interested in the development of efﬁcient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the ﬁrst-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We ﬁrst examine a two stages exploration-exploitation based algorithm which ﬁrst approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efﬁcient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.

3 0.47956669 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to signiﬁcantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1

4 0.31479555 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

Author: Abhradeep Guha Thakurta, Adam Smith

Abstract: We give differentially private algorithms for a large class of online learning algorithms, in both the full information and bandit settings. Our algorithms aim to minimize a convex loss function which is a sum of smaller convex loss terms, one for each data point. To design our algorithms, we modify the popular mirror descent approach, or rather a variant called follow the approximate leader. The technique leads to the ﬁrst nonprivate algorithms for private online learning in the bandit setting. In the full information setting, our algorithms improve over the regret bounds of previous work (due to Dwork, Naor, Pitassi and Rothblum (2010) and Jain, Kothari and Thakurta (2012)). In many cases, our algorithms (in both settings) match the dependence on the input length, T , of the optimal nonprivate regret bounds up to logarithmic factors in T . Our algorithms require logarithmic space and update time. 1

5 0.22940288 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Author: Rie Johnson, Tong Zhang

Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is signiﬁcantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1

6 0.19168872 89 nips-2013-Dimension-Free Exponentiated Gradient

7 0.1585996 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

8 0.14728083 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

9 0.14419293 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

10 0.13414349 26 nips-2013-Adaptive Market Making via Online Learning

11 0.12002781 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

12 0.11283528 230 nips-2013-Online Learning with Costly Features and Labels

13 0.093451478 325 nips-2013-The Pareto Regret Frontier

14 0.091277465 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

15 0.088615797 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

16 0.086839043 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

17 0.0848561 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

18 0.083440408 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

19 0.079355657 318 nips-2013-Structured Learning via Logistic Regression

20 0.074291751 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.18), (1, -0.064), (2, 0.238), (3, -0.176), (4, 0.013), (5, 0.012), (6, -0.286), (7, 0.243), (8, 0.39), (9, 0.162), (10, -0.044), (11, 0.171), (12, 0.103), (13, -0.267), (14, -0.034), (15, 0.172), (16, -0.103), (17, 0.058), (18, -0.024), (19, 0.037), (20, 0.061), (21, 0.015), (22, -0.087), (23, -0.039), (24, 0.042), (25, 0.053), (26, -0.057), (27, 0.027), (28, -0.082), (29, -0.039), (30, 0.085), (31, -0.069), (32, 0.019), (33, 0.033), (34, 0.029), (35, 0.053), (36, 0.067), (37, 0.041), (38, -0.046), (39, 0.016), (40, 0.028), (41, 0.068), (42, 0.011), (43, -0.029), (44, 0.013), (45, -0.027), (46, -0.015), (47, 0.002), (48, -0.029), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9826169 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

2 0.93980956 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin

3 0.90329808 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

4 0.81235868 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Author: Rie Johnson, Tong Zhang

5 0.65114009 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

6 0.6263634 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

7 0.50468546 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

8 0.48223847 89 nips-2013-Dimension-Free Exponentiated Gradient

9 0.47783315 26 nips-2013-Adaptive Market Making via Online Learning

10 0.46804482 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

11 0.44567439 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

12 0.35758409 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

13 0.32275978 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

14 0.31939894 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

15 0.31744051 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields

16 0.3077468 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

17 0.30759659 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

18 0.3070465 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

19 0.28907049 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

20 0.28000101 230 nips-2013-Online Learning with Costly Features and Labels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.045), (16, 0.02), (33, 0.172), (34, 0.067), (41, 0.232), (49, 0.019), (56, 0.088), (70, 0.014), (85, 0.022), (89, 0.016), (93, 0.025), (95, 0.024), (99, 0.159)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86499912 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for ﬁnding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

2 0.85222334 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

same-paper 3 0.84646928 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

4 0.83683598 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

Abstract: Inference in general Ising models is difﬁcult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We ﬁnd that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.

5 0.82665211 189 nips-2013-Message Passing Inference with Chemical Reaction Networks

Author: Nils E. Napp, Ryan P. Adams

Abstract: Recent work on molecular programming has explored new possibilities for computational abstractions with biomolecules, including logic gates, neural networks, and linear systems. In the future such abstractions might enable nanoscale devices that can sense and control the world at a molecular scale. Just as in macroscale robotics, it is critical that such devices can learn about their environment and reason under uncertainty. At this small scale, systems are typically modeled as chemical reaction networks. In this work, we develop a procedure that can take arbitrary probabilistic graphical models, represented as factor graphs over discrete random variables, and compile them into chemical reaction networks that implement inference. In particular, we show that marginalization based on sum-product message passing can be implemented in terms of reactions between chemical species whose concentrations represent probabilities. We show algebraically that the steady state concentration of these species correspond to the marginal distributions of the random variables in the graph and validate the results in simulations. As with standard sum-product inference, this procedure yields exact results for tree-structured graphs, and approximate solutions for loopy graphs.

6 0.78319073 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

7 0.78093553 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

8 0.76544344 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation

9 0.75770289 11 nips-2013-A New Convex Relaxation for Tensor Completion

10 0.73944837 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

11 0.73181301 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction

12 0.7218793 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

13 0.71994591 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

14 0.71310949 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

15 0.69886434 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

16 0.69222349 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model

17 0.67263174 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

18 0.66987878 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

19 0.66784811 54 nips-2013-Bayesian optimization explains human active search

20 0.66701609 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result