nips nips2013 nips2013-193 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. [sent-2, score-0.864]
2 This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). [sent-3, score-0.426]
3 In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. [sent-4, score-0.863]
4 Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. [sent-5, score-0.972]
5 We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). [sent-6, score-1.072]
6 In this study, we focus on the learning problems for which the loss function gi (w) is smooth. [sent-10, score-0.34]
7 Examples of smooth loss functions include least square with gi (w) = (yi −⟨w, xi ⟩)2 and logistic regression with gi (w) = log (1 + exp(−yi ⟨w, xi ⟩)). [sent-11, score-0.805]
8 Since the regularization is enforced through the restricted domain W, we did not introduce a ℓ2 regularizer λ∥w∥2 /2 into the optimization problem and as a result, we do not assume the loss function to be strongly convex. [sent-12, score-0.263]
9 We note that a small ℓ2 regularizer does NOT improve the convergence rate of stochastic optimization. [sent-13, score-0.355]
10 More specifically, the conver√ gence rate√ stochastically optimizing a ℓ2 regularized loss function remains as O(1/ T ) when for λ = O(1/ T ) [11, Theorem 1], a scenario that is often encountered in real-world applications. [sent-14, score-0.136]
11 A preliminary approach for solving the optimization problem in (1) is the batch gradient descent (GD) algorithm [16]. [sent-15, score-0.296]
12 It starts with some initial point, and iteratively updates the solution using the equation wt+1 = ΠW (wt − η G(wt )), where ΠW (·) is the orthogonal projection onto the convex domain W. [sent-16, score-0.14]
13 It has been shown that for smooth objective functions, the convergence rate of standard GD is O(1/T ) [16], and can be improved to O(1/T 2 ) by an accelerated GD algorithm [15, 16, 18]. [sent-17, score-0.354]
14 The main shortcoming of GD method is its high cost in computing the full gradient G(wt ) when the number of training examples is large. [sent-18, score-0.251]
15 Stochastic gradient descent (SGD) [3, 13, 21] alleviates this limitation of GD by sampling one (or a small set of) examples and computing a stochastic (sub)gradient at each iteration based on the sampled examples. [sent-19, score-0.375]
16 While SGD enjoys a high computational efficiency per iteration, it suffers from a slow convergence rate for optimizing smooth functions. [sent-23, score-0.317]
17 √ cannot be decreased with a better rate than O(1/ T ) which is significantly worse than GD that uses the full gradients for updating the solutions and this limitation is also valid when the target function is smooth. [sent-25, score-0.241]
18 In addition, as we can see from Table 1, for general Lipschitz functions, SGD exhibits the same convergence rate as that for the smooth functions, implying that smoothness is essentially not very useful and can not be exploited in stochastic optimization. [sent-26, score-0.534]
19 It is the variance in stochastic √ gradients that makes the convergence rate O(1/ T ) unimprovable in smooth setting [14, 1]. [sent-28, score-0.489]
20 In this study, we are interested in designing an efficient algorithm that is in the same spirit of SGD but can effectively leverage the smoothness of the loss function to achieve a significantly faster convergence rate. [sent-29, score-0.212]
21 To this end, we consider a new setup for optimization that allows us to interplay between stochastic and deterministic gradient descent methods. [sent-30, score-0.47]
22 We refer to this new setting as mixed optimization in order to distinguish it from both stochastic and full gradient optimization models. [sent-32, score-0.647]
23 The key question we examined in this study is: Is it possible to improve the convergence rate for stochastic optimization of smooth functions by having a small number of calls to the full gradient oracle Of ? [sent-33, score-1.137]
24 M IXED G RAD builds off on multistage methods [11] and operates in epochs, but involves novel ingredients so as to obtain an O(1/T ) rate for smooth losses. [sent-36, score-0.222]
25 In particular, we form a sequence of strongly convex objective functions to be optimized at each epoch and decrease the amount of regularization and shrink the domain as the algorithm proceeds. [sent-37, score-0.468]
26 The full gradient oracle Of is only called at the beginning of each epoch. [sent-38, score-0.488]
27 In contrast, M IXED G RAD is as an alternation of deterministic and stochastic gradient steps, with different of frequencies for each type of steps. [sent-40, score-0.366]
28 Our result for mixed optimization is useful for the scenario when the full gradient of the objective function can be computed relatively efficient although it is still significantly more expensive than computing a stochastic gradient. [sent-41, score-0.631]
29 An example of such a scenario is distributed computing where the computation of full gradients can be speeded up by having it run in parallel on many machines with each machine containing a relatively small subset of the entire training data. [sent-42, score-0.162]
30 Of course, the latency due to the communication between machines will result in an additional cost for computing the full gradient in a distributed fashion. [sent-43, score-0.251]
31 We begin in Section 2 by briefly reviewing the literature on deterministic and stochastic optimization. [sent-45, score-0.174]
32 Section 4 describes the M IXED G RAD algorithm and states the main result on its convergence rate. [sent-47, score-0.08]
33 1 The convergence rate can be improved to O(1/T ) when the structure of the objective function is provided. [sent-50, score-0.19]
34 We note that the stochastic oracle assumed in our study is slightly stronger than the stochastic gradient oracle as it returns the sampled function instead of the stochastic gradient. [sent-51, score-1.108]
35 2 2 2 More Related Work Deterministic Smooth Optimization The convergence rate of gradient based methods usually depends on the analytical properties of the objective function to be optimized. [sent-52, score-0.366]
36 When the objective function is strongly convex and smooth, it is well known that a simple GD method can achieve a linear convergence rate [5]. [sent-53, score-0.317]
37 For a √ non-smooth Lipschitz-continuous function, the optimal rate for √ the first order method is only O(1/ T ) [16]. [sent-54, score-0.087]
38 Although O(1/ T ) rate is not improvable in general, several recent studies are able to improve this rate to O(1/T ) by exploiting the special structure of the objective function [18, 17]. [sent-55, score-0.202]
39 In the full gradient based convex optimization, smoothness is a highly desirable property. [sent-56, score-0.386]
40 It has been shown that a simple GD achieves a convergence rate of O(1/T ) when the objective function is smooth, which is further can be improved to O(1/T 2 ) by using the accelerated gradient methods [15, 18, 16]. [sent-57, score-0.398]
41 Stochastic Smooth Optimization Unlike the optimization methods based on full gradients, the smoothness assumption was √ exploited by most stochastic optimization methods. [sent-58, score-0.484]
42 In fact, it was not shown in [14] that the O(1/ T ) convergence rate for stochastic optimization cannot be improved even when the objective function is smooth. [sent-59, score-0.425]
43 This classical result is further confirmed by the recent studies of composite bounds for the first order optimization methods [2, 12]. [sent-60, score-0.094]
44 The smoothness of the objective function is exploited extensively in mini-batch stochastic optimization [7, 8], where the goal is not to improve the convergence rate but to reduce the variance in stochastic gradients and consequentially the number of times for updating the solutions [24]. [sent-61, score-0.845]
45 We assume the objective function G(w) defined in (1) to be the average of n convex loss functions. [sent-66, score-0.143]
46 Besides convexity of individual functions, we will also assume that each gi (w) is β-smooth as formally defined below [16]. [sent-70, score-0.333]
47 A differentiable loss function f (w) is said to be β-smooth with respect to a norm ∥ · ∥, if it holds that β f (w) ≤ f (w′ ) + ⟨ f (w′ ), w − w′ ⟩ + ∥w − w′ ∥2 , ∀ w, w′ ∈ W, 2 The smoothness assumption also implies that ⟨ f (w) − is equivalent to f (w) being β-Lipschitz continuous. [sent-72, score-0.133]
48 The goal of stochastic optimization ¯ to use a bounded number T of oracle calls, and compute some w ∈ W such that the optimization ¯ error, G(w) − G(w∗ ), is as small as possible. [sent-74, score-0.546]
49 In the mixed optimization model considered in this study, we first relax the stochastic oracle Os by assuming that it will return a randomly sampled loss function gi (w), instead of the gradient gi (w) for a given solution w 3 . [sent-75, score-1.387]
50 Second, we assume that the learner also has an access to the full gradient oracle Of . [sent-76, score-0.518]
51 Our goal is to significantly improve the convergence rate of stochastic gradient descent (SGD) by making a small number of calls to the full gradient oracle Of . [sent-77, score-1.126]
52 In particular, we show that by having only O(log T ) accesses to the full gradient oracle and O(T ) accesses to the stochastic oracle, we can tolerate the noise in stochastic gradients and attain an O(1/T ) convergence rate for optimizing smooth functions. [sent-78, score-1.441]
53 3 The audience may feel that this relaxation of stochastic oracle could provide significantly more information, and second order methods such as Online Newton [10] may be applied to achieve O(1/T ) convergence. [sent-79, score-0.408]
54 We note (i) the proposed algorithm is a first order method, and (ii) although the Online Newton method yields a √ regret bound of O(1/T ), its convergence rate for optimization can be as low as O(1/ T ) due to the concentration bound for Martingales. [sent-80, score-0.245]
55 In addition, the Online Newton method is only applicable to exponential concave function, not any smooth loss function. [sent-81, score-0.173]
56 3 Algorithm 1 M IXED G RAD Input: step size η1 , domain size ∆1 , the number of iterations T1 for the first epoch, the number of epoches m, regularization parameter λ1 , and shrinking parameter γ > 1 ¯ 1: Initialize w1 = 0 2: for k = 1, . [sent-82, score-0.121]
57 , m do 3: Construct the domain Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k } ¯ 4: Call the full gradient oracle Of for G(wk ) ∑n 1 ¯ ¯ ¯ ¯ 5: Compute gk = λk wk + G(wk ) = λk wk + n i=1 gi (wk ) 1 6: Initialize wk = 0 7: for t = 1, . [sent-85, score-3.081]
58 It follows the epoch gradient descent algorithm proposed in [11] for stochastically minimizing strongly convex functions and divides the optimization process into m epochs, but involves novel ingredients so as to obtain an O(1/T ) convergence rate. [sent-94, score-0.773]
59 The key idea is to introduce a ℓ2 regularizer into the objective function to make it strongly convex, and gradually reduce the amount of regularization over the epochs. [sent-95, score-0.16]
60 ¯ Let wk be the solution obtained before the kth epoch, which is initialized to be 0 for the first epoch. [sent-100, score-0.658]
61 By introducing the ℓ2 regularizer, the objective function in (2) becomes strongly convex, making it possible to exploit the technique for stochastic optimization of strongly convex function in order to improve the convergence rate. [sent-102, score-0.538]
62 By removing the constant term λk ∥wk ∥2 /2 from the objective function in (2), we obtain the following optimization problem for the kth epoch [ ] n 1∑ λk 2 ¯ ¯ ∥w∥ + λk ⟨w, wk ⟩ + gi (w + wk ) , (3) min Fk (w) = w∈Wk 2 n i=1 4 where Wk = {w : w + wk ∈ W, ∥w∥ ≤ ∆k }. [sent-106, score-2.322]
63 n i=1 n ¯ gk = λk wk + k The main reason for using gi (w) instead of gi (w) is to tolerate the variance in the stochastic gradients. [sent-108, score-1.417]
64 To see this, from the smoothness assumption of gi (w) we obtain the following inequality for k the norm of gi (w) as: k ¯ ¯ gi (w) = ∥ gi (w + wk ) − gi (wk )∥ ≤ β∥w∥. [sent-109, score-2.131]
65 ¯ At the end of each epoch, we compute the average solution wk , and update the solution from wk to ¯ ¯ wk+1 = wk + wk . [sent-112, score-2.264]
66 Similar to the epoch gradient descent algorithm [11], we increase the number of iterations by a constant γ 2 for every epoch, i. [sent-113, score-0.432]
67 In order to perform stochastic gradient updating given in (5), we need to compute vector gk at the beginning of the kth epoch, which requires an access to the full gradient oracle Of . [sent-116, score-1.063]
68 It is easy to count that the number of accesses to the full gradient oracle Of is m, and the number of accesses to the stochastic oracle Os is m ∑ γ 2m − 1 T1 . [sent-117, score-1.11]
69 T = T1 γ 2(i−1) = 2 γ −1 i=1 Thus, if the total number of accesses to the stochastic gradient oracle is T , the number of access to the full gradient oracle required by Algorithm 1 is O(ln T ), consistent with our goal of making a small number of calls to the full gradient oracle. [sent-118, score-1.563]
70 The theorem below shows that for smooth objective functions, by having O(ln T ) access to the full gradient oracle Of and O(T ) access to the stochastic oracle Os , by running M IXED G RAD algorithm, we achieve an optimization error of O(1/T ). [sent-119, score-1.226]
71 Set γ = 2, λ1 = 16β and 1 m √ T1 = 300 ln , η1 = , and ∆1 = R. [sent-122, score-0.099]
72 Let wm+1 be the solution returned by Algorithm 1 after m epochs with m = O(ln T ) calls to the full gradient oracle Of and T calls to the stochastic oracle Os . [sent-124, score-1.115]
73 Let w∗ and w∗ be the optimal solutions that minimize Fk (w) and Fk+1 (w), respectively, and wk+1 be the average solution obtained at the end of ( √ kth epoch of M IXED G RAD ) k algorithm. [sent-132, score-0.367]
74 By setting the step size ηk = 1/ 2β 3Tk , we have, with a probability 1 − 2δ, ∆k λk ∆2 k+1 k ∥w∗ ∥ ≤ and Fk (wk+1 ) − min Fk (w) ≤ w γ 2γ 4 provided that δ ≤ e−9/2 and 300γ 8 β 2 1 Tk ≥ ln . [sent-134, score-0.099]
75 m m+1 Let w∗ be the optimal solution that minimizes Fm (w) and let w∗ be the optimal solution obtained in the last epoch. [sent-138, score-0.126]
76 By combining above inequalities, we obtain n λ1 ∆2 1∑ 2λ1 ∆2 1 1 m ¯ gi (wm+1 ) ≤ Fm (w∗ ) + 3m+1 + 2m−2 . [sent-141, score-0.299]
77 Since w∗ minimizes Fm (w), for any w∗ ∈ arg min G(w), we have n ) 1∑ λ1 ( m ¯ ¯ ¯ Fm (w∗ ) ≤ Fm (w∗ ) = (6) gi (w∗ ) + m−1 ∥w∗ − wm ∥2 + 2⟨w∗ − wm , wm ⟩ . [sent-143, score-0.998]
78 n i=1 2γ m ¯ Thus, the key to bound |F(w∗ ) − G(w∗ )| is to bound ∥w∗ − wm ∥. [sent-144, score-0.227]
79 For this sequence of solutions, Theorem 2 will hold deterministically as we deploy the full gradient for updating, i. [sent-150, score-0.251]
80 Since w∗ is the {wk }∞ ¯ k=m+1 and ∥wk ∥ ≤ ∆k for any k ≥ m + 1, we have ¯ limit of sequence {wk }∞ ∞ ∞ ∑ ∑ 2∆1 ∆1 ¯ ≤ m ∥w∗ − wm ∥ ≤ |wi | ≤ ∆k ≤ m γ (1 − γ −1 ) γ i=m+1 k=m+1 6 where the last step follows from the condition γ ≥ 2. [sent-154, score-0.227]
81 1 Proof of Theorem 2 For the convenience of discussion, we drop the subscript k for epoch just to simplify our notation. [sent-158, score-0.214]
82 Let w = wk be the solution obtained before the start of ¯ ¯ the epoch k, and let w′ = wk+1 be the solution obtained after running through the kth epoch. [sent-160, score-0.888]
83 We denote by F(w) and F ′ (w) the objective functions Fk (w) and Fk+1 (w). [sent-161, score-0.075]
84 They are given by n 1∑ λ ¯ ¯ ∥w∥2 + λ⟨w, w⟩ + gi (w + w) (8) F(w) = 2 n i=1 λ λ 1∑ ¯ ¯ gi (w + w′ ) ∥w∥2 + ⟨w, w′ ⟩ + 2γ γ n i=1 n F ′ (w) = (9) k ′ k+1 Let w∗ = w∗ and w∗ = w∗ be the optimal solutions that minimize F(w) and F ′ (w) over the domain Wk and Wk+1 , respectively. [sent-162, score-0.684]
85 With a probability 1 − 2δ, we have ) ) ( ( √ √ 1 1 1 1 BT ≤ β∆2 ln + 2T ln and CT ≤ 2β∆2 ln + 2T ln . [sent-173, score-0.396]
86 T +1 λ T +1 Thus, when T ≥ [300γ 8 β 2 ln 1 ]/λ2 , we have, with a probability 1 − 2δ, δ ∆2 λ 2 ∆ ≤ 4 , and |F(w) − F (w∗ )| ≤ 4 ∆2 . [sent-175, score-0.099]
87 6 Conclusions and Open Questions We presented a new paradigm of optimization, termed as mixed optimization, that aims to improve the convergence rate of stochastic optimization by making a small number of calls to the full gradient oracle. [sent-180, score-0.872]
88 We proposed the M IXED G RAD algorithm and showed that it is able to achieve an O(1/T ) convergence rate by accessing stochastic and full gradient oracles for O(T ) and O(log T ) times, respectively. [sent-181, score-0.599]
89 We showed that the M IXED G RAD algorithm is able to exploit the smoothness of the function, which is believed to be not very useful in stochastic optimization. [sent-182, score-0.231]
90 In the future, we would like to examine the optimality of our algorithm, namely if it is possible to achieve a better convergence rate for stochastic optimization of smooth functions using O(ln T ) accesses to the full gradient oracle. [sent-183, score-0.951]
91 Lastly, it is very interesting to check whether an O(1/T 2 ) rate could be achieved by an accelerated method in the mixed optimization scenario, and whether linear convergence rates could be achieved in the strongly-convex case. [sent-185, score-0.342]
92 Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. [sent-197, score-0.436]
93 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-202, score-0.103]
94 Sample size selection in optimization methods for machine learning. [sent-230, score-0.078]
95 Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. [sent-261, score-0.193]
96 A method of solving a convex programming problem with convergence rate o (1/k2). [sent-287, score-0.21]
97 Making gradient descent optimal for strongly convex stochastic optimization. [sent-307, score-0.503]
98 A stochastic gradient method with an exponential convergence rate for finite training sets. [sent-315, score-0.482]
99 Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. [sent-331, score-0.236]
100 O(logt) projections for stochastic optimization of smooth and strongly convex functions. [sent-338, score-0.477]
wordName wordTfidf (topN-words)
[('wk', 0.544), ('gi', 0.299), ('wt', 0.241), ('wm', 0.227), ('git', 0.224), ('oracle', 0.218), ('epoch', 0.194), ('ixed', 0.176), ('gradient', 0.176), ('os', 0.162), ('stochastic', 0.157), ('accesses', 0.133), ('rad', 0.132), ('smooth', 0.132), ('fm', 0.129), ('fk', 0.12), ('gd', 0.105), ('ln', 0.099), ('calls', 0.095), ('gk', 0.094), ('mixed', 0.083), ('sgd', 0.083), ('convergence', 0.08), ('optimization', 0.078), ('kth', 0.078), ('full', 0.075), ('smoothness', 0.074), ('rate', 0.069), ('tk', 0.067), ('convex', 0.061), ('gradients', 0.051), ('access', 0.049), ('strongly', 0.049), ('epochs', 0.045), ('ct', 0.044), ('bt', 0.043), ('domain', 0.043), ('descent', 0.042), ('loss', 0.041), ('objective', 0.041), ('stochastically', 0.038), ('solution', 0.036), ('shamir', 0.036), ('optimizing', 0.036), ('functions', 0.034), ('convexity', 0.034), ('accelerated', 0.032), ('consequentially', 0.032), ('epoches', 0.032), ('lipschitz', 0.03), ('newton', 0.029), ('lemma', 0.029), ('regularizer', 0.026), ('regularization', 0.026), ('oracles', 0.025), ('solutions', 0.025), ('returns', 0.025), ('lemmas', 0.024), ('tolerate', 0.024), ('improve', 0.023), ('exploited', 0.022), ('processors', 0.022), ('updating', 0.021), ('br', 0.021), ('ingredients', 0.021), ('proof', 0.021), ('scenario', 0.021), ('shrink', 0.02), ('iterations', 0.02), ('subscript', 0.02), ('termed', 0.02), ('beginning', 0.019), ('shrinks', 0.019), ('jin', 0.019), ('lectures', 0.019), ('minimizes', 0.018), ('regret', 0.018), ('norm', 0.018), ('gradually', 0.018), ('optimal', 0.018), ('achieve', 0.017), ('deterministic', 0.017), ('composite', 0.016), ('end', 0.016), ('theorem', 0.016), ('online', 0.016), ('paradigm', 0.016), ('hybrid', 0.016), ('logt', 0.016), ('nemirovsky', 0.016), ('alternation', 0.016), ('audience', 0.016), ('friedlander', 0.016), ('lijun', 0.016), ('soviet', 0.016), ('rmative', 0.016), ('hazan', 0.016), ('goal', 0.015), ('yi', 0.015), ('speeded', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
2 0.47956669 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
3 0.28819865 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin
Abstract: In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.
4 0.20352426 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
Author: Zhuo Wang, Alan Stocker, Daniel Lee
Abstract: In many neural systems, information about stimulus variables is often represented in a distributed manner by means of a population code. It is generally assumed that the responses of the neural population are tuned to the stimulus statistics, and most prior work has investigated the optimal tuning characteristics of one or a small number of stimulus variables. In this work, we investigate the optimal tuning for diffeomorphic representations of high-dimensional stimuli. We analytically derive the solution that minimizes the L2 reconstruction loss. We compared our solution with other well-known criteria such as maximal mutual information. Our solution suggests that the optimal weights do not necessarily decorrelate the inputs, and the optimal nonlinearity differs from the conventional equalization solution. Results illustrating these optimal representations are shown for some input distributions that may be relevant for understanding the coding of perceptual pathways. 1
5 0.20188348 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
6 0.17012689 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
7 0.12736234 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
8 0.12618284 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
9 0.12429318 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
10 0.12027103 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
11 0.11336499 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
12 0.11220211 89 nips-2013-Dimension-Free Exponentiated Gradient
13 0.10767023 257 nips-2013-Projected Natural Actor-Critic
14 0.10616685 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
15 0.093458645 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
16 0.088639431 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
17 0.082981363 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
18 0.082434863 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
19 0.07708697 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
20 0.077007674 230 nips-2013-Online Learning with Costly Features and Labels
topicId topicWeight
[(0, 0.185), (1, -0.046), (2, 0.129), (3, -0.071), (4, -0.009), (5, 0.067), (6, -0.248), (7, 0.165), (8, 0.328), (9, 0.129), (10, -0.024), (11, 0.221), (12, 0.046), (13, -0.254), (14, -0.03), (15, 0.111), (16, -0.097), (17, 0.076), (18, -0.045), (19, 0.056), (20, 0.074), (21, 0.022), (22, -0.046), (23, -0.021), (24, -0.016), (25, 0.013), (26, -0.067), (27, 0.029), (28, -0.047), (29, -0.043), (30, -0.013), (31, -0.016), (32, -0.011), (33, -0.028), (34, -0.064), (35, 0.026), (36, 0.017), (37, 0.045), (38, -0.022), (39, 0.055), (40, -0.02), (41, 0.004), (42, 0.031), (43, 0.019), (44, 0.053), (45, -0.03), (46, -0.07), (47, 0.042), (48, -0.05), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.97947627 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
2 0.93827468 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
3 0.88071924 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin
Abstract: In this paper, we are interested in the development of efficient algorithms for convex optimization problems in the simultaneous presence of multiple objectives and stochasticity in the first-order information. We cast the stochastic multiple objective optimization problem into a constrained optimization problem by choosing one function as the objective and try to bound other objectives by appropriate thresholds. We first examine a two stages exploration-exploitation based algorithm which first approximates the stochastic objectives by sampling and then solves a constrained stochastic optimization problem by projected gradient method. This method attains a suboptimal convergence rate even under strong assumption on the objectives. Our second approach is an efficient primal-dual stochastic algorithm. It leverages on the theory of Lagrangian method √ conin strained optimization and attains the optimal convergence rate of O(1/ T ) in high probability for general Lipschitz continuous objectives.
4 0.83183819 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Author: Rie Johnson, Tong Zhang
Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1
5 0.68823254 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
Author: Tianbao Yang
Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1
6 0.60383755 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
7 0.55211163 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
8 0.53912115 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
9 0.49664521 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
10 0.45388085 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
11 0.4133437 257 nips-2013-Projected Natural Actor-Critic
12 0.40979919 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
13 0.40247527 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
14 0.39372802 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
15 0.39254278 89 nips-2013-Dimension-Free Exponentiated Gradient
16 0.34605455 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
17 0.34347528 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
18 0.3424077 26 nips-2013-Adaptive Market Making via Online Learning
19 0.33878091 249 nips-2013-Polar Operators for Structured Sparse Estimation
20 0.33833057 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
topicId topicWeight
[(16, 0.027), (33, 0.164), (34, 0.047), (41, 0.419), (49, 0.029), (56, 0.095), (70, 0.018), (85, 0.025), (89, 0.018), (93, 0.032), (95, 0.014), (99, 0.021)]
simIndex simValue paperId paperTitle
1 0.91381198 257 nips-2013-Projected Natural Actor-Critic
Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan
Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1
same-paper 2 0.85858536 193 nips-2013-Mixed Optimization for Smooth Functions
Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin
Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1
3 0.8511619 189 nips-2013-Message Passing Inference with Chemical Reaction Networks
Author: Nils E. Napp, Ryan P. Adams
Abstract: Recent work on molecular programming has explored new possibilities for computational abstractions with biomolecules, including logic gates, neural networks, and linear systems. In the future such abstractions might enable nanoscale devices that can sense and control the world at a molecular scale. Just as in macroscale robotics, it is critical that such devices can learn about their environment and reason under uncertainty. At this small scale, systems are typically modeled as chemical reaction networks. In this work, we develop a procedure that can take arbitrary probabilistic graphical models, represented as factor graphs over discrete random variables, and compile them into chemical reaction networks that implement inference. In particular, we show that marginalization based on sum-product message passing can be implemented in terms of reactions between chemical species whose concentrations represent probabilities. We show algebraically that the steady state concentration of these species correspond to the marginal distributions of the random variables in the graph and validate the results in simulations. As with standard sum-product inference, this procedure yields exact results for tree-structured graphs, and approximate solutions for loopy graphs.
4 0.8176387 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
Author: Justin Domke, Xianghang Liu
Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.
5 0.76261622 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin
Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to find an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1
6 0.75560117 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
7 0.73293364 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
8 0.68616503 11 nips-2013-A New Convex Relaxation for Tensor Completion
9 0.66676056 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
10 0.61112434 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
11 0.59771973 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
12 0.56115264 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
13 0.55392551 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
14 0.55365711 54 nips-2013-Bayesian optimization explains human active search
15 0.54255342 184 nips-2013-Marginals-to-Models Reducibility
16 0.53983748 301 nips-2013-Sparse Additive Text Models with Low Rank Background
17 0.53658658 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
18 0.53348523 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
19 0.5333221 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors
20 0.53324372 176 nips-2013-Linear decision rule as aspiration for simple decision heuristics