nips nips2012 nips2012-134 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. [sent-6, score-0.416]
2 We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. [sent-7, score-0.678]
3 We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. [sent-8, score-0.764]
4 1 Introduction Derivative-free optimization schemes have a long history in optimization (see, for example, the book by Spall [21]), and they have the clearly desirable property of never requiring explicit gradient calculations. [sent-9, score-0.358]
5 Classical techniques in stochastic and non-stochastic optimization, including KieferWolfowitz-type procedures [e. [sent-10, score-0.182]
6 Despite the long history and recent renewed interest in such procedures, an understanding of their finite-sample convergence rates remains elusive. [sent-17, score-0.276]
7 Our focus is on the convergence rates of algorithms that observe only stochastic realizations of the function values f (θ). [sent-19, score-0.45]
8 [13], who build on this approach, applying it to bandit convex optimization problems. [sent-22, score-0.341]
9 The convergence rates in these works are (retrospectively) sub-optimal [20, 2]: Agarwal et al. [sent-23, score-0.306]
10 [2] √ provide algorithms that achieve convergence rates (ignoring logarithmic factors) of O(poly(d)/ k), where poly(d) is a polynomial in the dimension d, for stochastic algorithms receiving only single function values, but (as the authors themselves note) the algorithms are quite complicated. [sent-24, score-0.499]
11 Some of the difficulties inherent in optimization using only a single function evaluation can be alleviated when the function F (·; x) can be evaluated at two points, as noted independently by Agarwal et al. [sent-25, score-0.172]
12 The insight is that for small u, the quantity (F (θ + uZ; x) − F (θ; x))/u approximates a directional derivative of F (θ; x) and can thus be used in first-order optimization schemes. [sent-27, score-0.204]
13 Such two-sample-based gradient estimators allow simpler analyses, with sharper convergence rates [1, 20], than algorithms that have access to only a single function evaluation in each iteration. [sent-28, score-0.432]
14 In the current paper, we take this line of work further, finding the optimal rate of convergence for procedures that are only able to obtain function evaluations, F (·; X), for samples X. [sent-29, score-0.226]
15 In this setting, we analyze stochastic gradient and mirror-descent procedures [27, 18, 6, 19] that construct gradient estimators using the two-point observations Y t . [sent-33, score-0.494]
16 By a careful analysis of the dimension dependence of certain random perturbation schemes, we√ show that the convergence rate attained by our stochastic gradient methods is roughly a factor of d worse than that attained by stochastic methods that observe the full gradient F (θ; X). [sent-34, score-1.022]
17 Under appropriate √ conditions, our convergence rates are a factor of d better than those attained by Agarwal et al. [sent-35, score-0.386]
18 In addition, though we present our results in the framework of stochastic optimization, our analysis applies to (two-point) bandit online convex optimization problems [13, 5, 1], and we consequently obtain the sharpest rates for such problems. [sent-37, score-0.722]
19 Finally, we show that the convergence rates we provide are tight—meaning sharp to within constant factors—by using information-theoretic techniques for constructing lower bounds on statistical estimators. [sent-38, score-0.556]
20 2 Algorithms Stochastic mirror descent methods are a class of stochastic gradient methods for solving the problem minθ∈Θ f (θ). [sent-39, score-0.682]
21 They are based on a proximal function ψ, which is a differentiable convex function defined over Θ that is assumed (w. [sent-40, score-0.211]
22 by scaling) to be 1-strongly convex with respect to the norm · over Θ. [sent-44, score-0.192]
23 The mirror descent (MD) method proceeds in a sequence of iterations that we index by t, updating the parameter vector θt ∈ Θ using stochastic gradient information to form θt+1 . [sent-46, score-0.682]
24 The proximal function ψ is strongly convex with respect to the norm · . [sent-52, score-0.301]
25 2 2 Our second assumption is standard for almost all first-order stochastic gradient methods [19, 24, 20], and it holds whenever the functions F (·; x) are G-Lipschitz with respect to the norm · . [sent-54, score-0.473]
26 We use · ∗ to denote the dual norm to · , and let g : Θ × X → Rd denote a measurable subgradient selection for the functions F ; that is, g(θ; x) ∈ ∂F (θ; x) with E[g(θ; X)] ∈ ∂f (θ). [sent-55, score-0.251]
27 When Assumptions A and B hold, the convergence rate of stochastic mirror descent methods is well understood [6, 19, Section 2. [sent-58, score-0.713]
28 according t to P , set g√= g(θt ; X t ), and let θt be generated by the mirror descent iteration (4) with stepsize α(t) = α/ t. [sent-63, score-0.383]
29 2α k k (5) For the remainder of this section, we explore the use of function difference information to obtain subgradient estimates that can be used in mirror descent methods to achieve statements similar to the convergence guarantee (5). [sent-65, score-0.619]
30 1 Two-point gradient estimates and general convergence rates In this section, we show—under a reasonable additional assumption—how to use two samples of the random function values F (θ; X) to construct nearly unbiased estimators of the gradient f (θ) of the expected function f . [sent-67, score-0.688]
31 ut (6) We then apply the mirror descent update (4) to the quantity g t . [sent-72, score-0.542]
32 Intuitively, for ut small enough in the construction (6), the vector g t should be a nearly unbiased estimator of the gradient f (θ). [sent-78, score-0.435]
33 There is a function L : X → R+ such that for (P -almost every) x ∈ X , the function F (·; x) has L(x)-Lipschitz continuous gradient with respect to the norm · , and the quantity L(P )2 := E[L(X)2 ] < ∞. [sent-81, score-0.295]
34 Furthermore, for appropriate random vectors Z, we can also show that g t has small norm, which yields better convergence rates for mirror descent-type methods. [sent-83, score-0.532]
35 ) In order to study the convergence of mirror descent methods using the estimator (6), we make the following additional assumption on the distribution µ. [sent-85, score-0.644]
36 Let {ut } ⊂ R+ be a non-increasing sequence of positive numbers, and let θt be generated according to the mirror descent update (4) using the gradient estimator (6). [sent-95, score-0.608]
37 The proof of Theorem 1 requires some technical care—we never truly receive unbiased gradients— and it builds on convergence proofs developed in the analysis of online and stochastic convex optimization [27, 19, 1, 12, 20] to achieve bounds of the form (5). [sent-97, score-0.838]
38 By using Assumption C, we see that for small enough ut , the gradient estimator g t from (6) is close (in expectation with respect to X t ) to f (θt , Z t )Z t , which is an unbiased estimate of f (θt ). [sent-100, score-0.398]
39 Assumption C allows us to bound the moments of the gradient estimator g t . [sent-101, score-0.257]
40 By carefully showing that taking care to make sure that the errors in g t as an estimator of f (θt ) scale with ut , we given an analysis similar to that used to derive the bound (5) to obtain Theorem 1. [sent-102, score-0.24]
41 In addition, the convergence rate of the method is independent of the Lipschitz continuity constant L(P ) of the instantaneous gradients F (·; X); the penalty for nearly non-smooth objective functions comes into the bound only as a second-order term. [sent-109, score-0.417]
42 Additionally, though we have presented our results as convergence guarantees for stochastic optimization problems, an inspection of our analysis in Appendix A. [sent-114, score-0.381]
43 1 shows that we obtain (expected) regret bounds for bandit online convex optimization problems [e. [sent-115, score-0.56]
44 2 Examples and corollaries In this section, we provide examples of random sampling strategies that give direct convergence rate estimates for the mirror descent algorithm with subgradient samples (6). [sent-119, score-0.726]
45 For each corollary, we specify the norm · , proximal function ψ, and distribution µ, verify that Assumptions A, B, C, and D hold, and then apply Theorem 1 to obtain a convergence rate. [sent-120, score-0.336]
46 We begin with a corollary that describes the convergence rate of our algorithm when the expected function f is Lipschitz continuous with respect to the Euclidean norm · 2 . [sent-121, score-0.366]
47 Given the proximal function ψ(θ) := 2 θ 2 , suppose that E[ g(θ; X) 2 ] ≤ G2 and √ that µ is uniform on the surface of the 2 -ball of radius d. [sent-123, score-0.247]
48 The rate provided by Corollary 1 is the fastest derived to date for zero-order stochastic optimization using two function evaluations. [sent-128, score-0.294]
49 [1] and Nesterov [20] achieve rates of √ convergence of order RGd/ k. [sent-130, score-0.276]
50 Nonetheless, Assumption C does not require a uniform bound on the Lipschitz constant L(X) of the gradients F (·; X); moreover, the convergence rate of the method is essentially independent of L(P ). [sent-132, score-0.38]
51 In high-dimensional scenarios, appropriate choices for the proximal function ψ yield better scaling on the norm of the gradients [18, 14, 19, 12]. [sent-133, score-0.337]
52 This scaling is more palatable than de√ pendence on Euclidean norms applied to the gradient vectors, which may be a factor of d larger. [sent-135, score-0.156]
53 There are universal constants C1 < 20e and C2 < 10e such that √ √ RG d log d RG d log d ∗ −1 √ E f (θ(k)) − f (θ ) ≤ C1 αu2 + u log k . [sent-142, score-0.262]
54 max α, α + C2 k k Proof The proof of this corollary is somewhat involved. [sent-143, score-0.179]
55 First, we recall [18, 7, Appendix 1] that our choice of ψ is strongly convex with respect to the norm · p . [sent-145, score-0.192]
56 As a consequence, we see that we may take the norm · = · 1 and the dual 2 2 norm · ∗ = · ∞ , and E[ g, Z Z q ] ≤ e2 E[ g, Z Z ∞ ]. [sent-147, score-0.213]
57 5 √ Corollary 2 attains a convergence rate that scales with dimension as d log d. [sent-159, score-0.236]
58 This dependence on dimension is much worse than that of (stochastic) mirror descent using full gradient information [18, 19]. [sent-160, score-0.572]
59 The additional dependence on d suggests that while O(1/ 2 ) iterations are required to achieve -optimization accuracy for mirror descent methods (ignoring logarithmic factors), the twopoint method requires O(d/ 2 ) iterations to obtain the same accuracy. [sent-161, score-0.463]
60 In the next section, we show that this dependence is sharp: except for logarithmic factors, no algorithm can attain better convergence rates, including the problem-dependent constants R and G. [sent-163, score-0.257]
61 3 Lower bounds on zero-order optimization We turn to providing lower bounds on the rate of convergence for any method that receives random function values. [sent-164, score-0.581]
62 For our lower bounds, we fix a norm · on Rd and as usual let · ∗ denote its dual norm. [sent-165, score-0.176]
63 We assume that Θ = {θ ∈ Rd : θ ≤ R} is the norm ball of radius R. [sent-166, score-0.16]
64 We study all optimization methods that receive function values of random convex functions, building on the analysis of stochastic gradient methods by Agarwal et al. [sent-167, score-0.576]
65 The classes of functions over which we prove our lower bounds consist of those satisfying Assumption B, that is, for a given Lipschitz constant G > 0, optimization problems over the set FG . [sent-173, score-0.408]
66 We we are thus interested in its expected value, and we define the minimax error ∗ k (FG , Θ) := inf sup A∈Ak (F,P )∈FG E[ k (A, F, P, Θ)], (8) where the expectation is over the observations (Y 1 , . [sent-177, score-0.223]
67 1 Lower bounds and optimality In this section, we give lower bounds on the minimax rate of optimization for a few specific settings. [sent-182, score-0.66]
68 We present our main results, then recall Corollaries 1 and 2, which imply we have attained the minimax rates of convergence for zero-order (stochastic) optimization schemes. [sent-183, score-0.634]
69 We begin by providing minimax lower bounds when the expected function f (θ) = E[F (θ; X)] is Lipschitz continuous with respect to the 2 -norm. [sent-185, score-0.35]
70 Let Θ = θ ∈ Rd : θ 2 ≤ R and FG consist of pairs (F, P ) for which the sub2 gradient mapping g satisfies EP [ g(θ; X) 2 ] ≤ G2 for θ ∈ Θ. [sent-188, score-0.193]
71 k (FG , Θ) ≥ c √ k Combining the lower bounds provided by Proposition 1 with our algorithmic scheme in Section 2 shows that our analysis is √ √ essentially sharp, since Corollary 1 provides an upper bound for the minimax optimality of RG d/ k. [sent-190, score-0.421]
72 The stochastic gradient descent algorithm (4) coupled with the sampling strategy (6) is thus optimal for stochastic problems with two-point feedback. [sent-191, score-0.604]
73 Now we investigate the minimax rates at which it is possible to solve stochastic convex optimization problems whose objectives are Lipschitz continuous with respect to the 1 -norm. [sent-192, score-0.697]
74 k (FG , Θ) ≥ c √ k We may again consider the optimality of our mirror descent algorithms, recalling Corollary 2. [sent-199, score-0.453]
75 In this 2 1 case, the MD algorithm (4) with the choice ψ(θ) = 2(p−1) θ p , where p = 1 + 1/ log d, implies that there exist universal constants c and C such that √ √ GR d GR d log d √ c √ ≤ ∗ (FG , Θ) ≤ C k k k for the problem class described in Proposition 2. [sent-200, score-0.213]
76 When full gradient information is available, that is, one has access to the subgradient selection √ g(θ; X), the d factors appearing in the lower bounds in Proposition 1 and 2 are not present [3]. [sent-203, score-0.475]
77 √ The d factors similarly disappear from the convergence rates in Corollaries 1 and 2 when one uses g t = g(θ; X) in the mirror descent updates (4); said differently, the constant s(d) = 1 in Theorem 1 [19, 6]. [sent-204, score-0.739]
78 As noted in Section 2, our lower bounds consequently show that in addition to dependence on the radius R and second moment G2 in the case when gradients are available [3], √ all algorithms must suffer an additional O( d) penalty in convergence rate. [sent-205, score-0.589]
79 This suggests that for high-dimensional problems, it is preferable to use full gradient information if possible, even when the cost of obtaining the gradients is somewhat high. [sent-206, score-0.313]
80 2 Proofs of lower bounds We sketch proofs for our lower bounds on the minimax error (8), which are based on the framework introduced by Agarwal et al. [sent-208, score-0.585]
81 Define fv (θ) = EPv [Fv (θ; X)], and let θv ∈ argminθ∈Θ fv (θ). [sent-212, score-0.816]
82 [3], we define the separation between two functions as ∗ ∗ ρ(fv , fw ) := inf fv (θ) + fw (θ) − fv (θv ) − fw (θw ), θ∈Θ and the minimal separation of the set V (this may depend on the accuracy parameter δ) as ρ∗ (V) := min{ρ(fv , fw ) : v, w ∈ V, v = w}. [sent-215, score-1.475]
83 σ2 Using Lemma 4, we can obtain a lower bound on the minimax optimization error whenever the instantaneous objective functions are of the form F (θ; x) = θ, x . [sent-243, score-0.392]
84 (12) The inequality (12) can give concrete lower bounds on the minimax optimization error. [sent-245, score-0.451]
85 By construction fv (θ) = δ θ, v , which allows us to give lower bounds on the minimal separation ρ∗ (V) for the choices of the norm constraint θ ≤ R in Propositions 1 and 2. [sent-254, score-0.756]
86 For Proposition 1, an argument using the probabilistic method implies that there are universal constants c1 , c2 > 0 for which there is a 1 packing V of the 2 -sphere of radius 1 with size 2 at least |V| ≥ exp(c1 d) and such that (1/|V|) v∈V vv c2 Id×d /d. [sent-258, score-0.244]
87 By the linearity of fv , 2 2 we find ρ(fv , fw ) ≥ δR/16, and setting σ = G /(2d) and δ as in the choice (11) implies that 2 E[ X 2 ] ≤ G2 . [sent-259, score-0.562]
88 4 Discussion We have analyzed algorithms for stochastic optimization problems that use only random function values—as opposed to gradient computations—to minimize an objective function. [sent-268, score-0.435]
89 As our development of minimax lower bounds shows, the algorithms we present, which build on those proposed by Agarwal et al. [sent-269, score-0.38]
90 [1] and Nesterov [20], are optimal: their convergence rates cannot be improved (in a minimax sense) by more than numerical constant factors. [sent-270, score-0.486]
91 As a consequence of our results, we have attained sharp rates for bandit online convex optimization problems with multi-point feedback. [sent-271, score-0.766]
92 We have also shown that there is a necessary sharp transition in convergence rates between stochastic gradient algorithms and algorithms that compute only function values. [sent-272, score-0.649]
93 This result highlights the advantages of using gradient information when it is available, but we recall that there are many applications in which gradients are not available. [sent-273, score-0.255]
94 Finally, one question that this work leaves open, and which we are actively attempting to address, is whether our convergence rates extend to non-smooth optimization problems. [sent-274, score-0.377]
95 Optimal algorithms for online convex optimization with multi-point bandit feedback. [sent-287, score-0.405]
96 Information-theoretic lower bounds on the oracle complexity of convex optimization. [sent-309, score-0.275]
97 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-332, score-0.328]
98 The ordered subsets mirror descent optimization method with applications to tomography. [sent-338, score-0.484]
99 Online convex optimization in the bandit setting: gradient descent without a gradient. [sent-379, score-0.624]
100 Dual averaging methods for regularized stochastic learning and online optimization. [sent-440, score-0.207]
wordName wordTfidf (topN-words)
[('fv', 0.408), ('fg', 0.343), ('mirror', 0.256), ('rg', 0.178), ('minimax', 0.177), ('pv', 0.158), ('gradient', 0.156), ('agarwal', 0.148), ('stochastic', 0.143), ('rates', 0.139), ('bandit', 0.138), ('convergence', 0.137), ('descent', 0.127), ('fw', 0.123), ('bounds', 0.12), ('zz', 0.119), ('ut', 0.11), ('proximal', 0.109), ('rd', 0.107), ('convex', 0.102), ('optimization', 0.101), ('gradients', 0.099), ('subgradient', 0.099), ('norm', 0.09), ('corollary', 0.089), ('nesterov', 0.082), ('attained', 0.08), ('universal', 0.075), ('sharp', 0.074), ('lipschitz', 0.07), ('nemirovski', 0.07), ('gr', 0.07), ('proposition', 0.07), ('radius', 0.07), ('estimator', 0.069), ('ep', 0.069), ('online', 0.064), ('unbiased', 0.063), ('defer', 0.062), ('md', 0.061), ('twenty', 0.059), ('flaxman', 0.059), ('packing', 0.059), ('somewhat', 0.058), ('corollaries', 0.057), ('assumption', 0.055), ('directional', 0.054), ('lower', 0.053), ('uz', 0.052), ('ak', 0.051), ('rate', 0.05), ('log', 0.049), ('appendix', 0.049), ('quantity', 0.049), ('player', 0.049), ('factors', 0.047), ('logarithmic', 0.047), ('separation', 0.046), ('inf', 0.046), ('perturbation', 0.044), ('receive', 0.044), ('lemma', 0.043), ('fano', 0.043), ('sketches', 0.041), ('noted', 0.041), ('constants', 0.04), ('exponentiated', 0.04), ('optimality', 0.039), ('procedures', 0.039), ('choices', 0.039), ('surface', 0.039), ('nearly', 0.037), ('army', 0.037), ('theorem', 0.037), ('consist', 0.037), ('chapter', 0.037), ('poly', 0.036), ('suffer', 0.036), ('propositions', 0.035), ('adversary', 0.035), ('problems', 0.035), ('dependence', 0.033), ('url', 0.033), ('consequence', 0.033), ('constant', 0.033), ('dual', 0.033), ('receiving', 0.033), ('proofs', 0.032), ('bound', 0.032), ('proof', 0.032), ('siam', 0.031), ('recalling', 0.031), ('linearity', 0.031), ('cov', 0.031), ('realizations', 0.031), ('et', 0.03), ('multiplier', 0.03), ('functions', 0.029), ('uniform', 0.029), ('care', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
2 0.22249562 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
3 0.16689122 275 nips-2012-Privacy Aware Learning
Author: Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we establish sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator. 1
4 0.16352074 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach
Abstract: We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1
5 0.16156597 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
6 0.15403453 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
7 0.13255389 324 nips-2012-Stochastic Gradient Descent with Only One Projection
8 0.12523238 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
9 0.12353576 282 nips-2012-Proximal Newton-type methods for convex optimization
10 0.12150612 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
11 0.11688185 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
12 0.11571461 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
13 0.1125403 285 nips-2012-Query Complexity of Derivative-Free Optimization
14 0.11081002 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
15 0.10127676 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
16 0.099009939 254 nips-2012-On the Sample Complexity of Robust PCA
17 0.096690074 292 nips-2012-Regularized Off-Policy TD-Learning
18 0.093262926 16 nips-2012-A Polynomial-time Form of Robust Regression
19 0.090775505 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
20 0.090008423 293 nips-2012-Relax and Randomize : From Value to Algorithms
topicId topicWeight
[(0, 0.237), (1, -0.005), (2, 0.204), (3, -0.046), (4, 0.121), (5, 0.109), (6, -0.023), (7, -0.005), (8, 0.025), (9, -0.012), (10, 0.096), (11, 0.101), (12, -0.118), (13, -0.014), (14, -0.05), (15, 0.085), (16, -0.077), (17, -0.144), (18, 0.014), (19, 0.015), (20, 0.022), (21, -0.078), (22, 0.103), (23, 0.015), (24, -0.081), (25, -0.067), (26, 0.003), (27, 0.038), (28, -0.011), (29, -0.038), (30, 0.13), (31, -0.068), (32, -0.056), (33, -0.047), (34, -0.029), (35, 0.074), (36, 0.128), (37, -0.041), (38, 0.064), (39, 0.017), (40, 0.055), (41, -0.161), (42, 0.02), (43, -0.009), (44, -0.038), (45, -0.014), (46, 0.014), (47, 0.057), (48, 0.033), (49, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.96576166 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
2 0.87355191 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
3 0.84079474 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
Author: Xi Chen, Qihang Lin, Javier Pena
Abstract: This paper considers a wide spectrum of regularized stochastic optimization problems where both the loss function and regularizer can be non-smooth. We develop a novel algorithm based on the regularized dual averaging (RDA) method, that can simultaneously achieve the optimal convergence rates for both convex and strongly convex loss. In particular, for strongly convex loss, it achieves the opti1 1 mal rate of O( N + N 2 ) for N iterations, which improves the rate O( log N ) for preN vious regularized dual averaging algorithms. In addition, our method constructs the final solution directly from the proximal mapping instead of averaging of all previous iterates. For widely used sparsity-inducing regularizers (e.g., 1 -norm), it has the advantage of encouraging sparser solutions. We further develop a multistage extension using the proposed algorithm as a subroutine, which achieves the 1 uniformly-optimal rate O( N + exp{−N }) for strongly convex loss. 1
4 0.78231055 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach
Abstract: We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1
5 0.7741766 285 nips-2012-Query Complexity of Derivative-Free Optimization
Author: Ben Recht, Kevin G. Jamieson, Robert Nowak
Abstract: This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same. 1
6 0.75575507 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
7 0.69479012 161 nips-2012-Interpreting prediction markets: a stochastic approach
8 0.68376273 217 nips-2012-Mixability in Statistical Learning
9 0.63801426 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
10 0.62528193 275 nips-2012-Privacy Aware Learning
11 0.60525513 16 nips-2012-A Polynomial-time Form of Robust Regression
12 0.58506542 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
13 0.57648623 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
14 0.55867994 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
15 0.54024988 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
16 0.53758591 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
17 0.53357786 206 nips-2012-Majorization for CRFs and Latent Likelihoods
18 0.52748746 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
19 0.51989955 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
20 0.51859796 27 nips-2012-A quasi-Newton proximal splitting method
topicId topicWeight
[(0, 0.027), (21, 0.022), (36, 0.014), (38, 0.192), (42, 0.394), (54, 0.014), (74, 0.027), (76, 0.115), (80, 0.081), (92, 0.039)]
simIndex simValue paperId paperTitle
1 0.8718912 289 nips-2012-Recognizing Activities by Attribute Dynamics
Author: Weixin Li, Nuno Vasconcelos
Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1
same-paper 2 0.86147028 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
3 0.86078775 242 nips-2012-Non-linear Metric Learning
Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger
Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational benefits of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classification error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1
4 0.85166979 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi
Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1
5 0.85051763 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes
Author: Charles Blundell, Jeff Beck, Katherine A. Heller
Abstract: We present a Bayesian nonparametric model that discovers implicit social structure from interaction time-series data. Social groups are often formed implicitly, through actions among members of groups. Yet many models of social networks use explicitly declared relationships to infer social structure. We consider a particular class of Hawkes processes, a doubly stochastic point process, that is able to model reciprocity between groups of individuals. We then extend the Infinite Relational Model by using these reciprocating Hawkes processes to parameterise its edges, making events associated with edges co-dependent through time. Our model outperforms general, unstructured Hawkes processes as well as structured Poisson process-based models at predicting verbal and email turn-taking, and military conflicts among nations. 1
6 0.70297199 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
7 0.68474603 285 nips-2012-Query Complexity of Derivative-Free Optimization
8 0.6719408 275 nips-2012-Privacy Aware Learning
9 0.669635 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
10 0.65716404 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
11 0.65360469 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
12 0.64507473 358 nips-2012-Value Pursuit Iteration
13 0.64491421 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
14 0.64237732 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
15 0.640917 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
16 0.63043404 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
17 0.6294536 217 nips-2012-Mixability in Statistical Learning
18 0.62768346 292 nips-2012-Regularized Off-Policy TD-Learning
19 0.62480187 324 nips-2012-Stochastic Gradient Descent with Only One Projection
20 0.62137884 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing