nips nips2013 nips2013-137 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. [sent-5, score-0.178]
2 We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i. [sent-6, score-0.408]
3 In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. [sent-9, score-0.522]
4 We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. [sent-10, score-1.232]
5 1 Introduction The optimization of non-linear functions whose evaluation may be noisy and expensive is a challenge that has important applications in sciences and engineering. [sent-12, score-0.216]
6 One approach to this notoriously hard problem takes a Bayesian perspective, which uses the predictive uncertainty in order to trade exploration (gathering data for reducing model uncertainty) and exploitation (focusing sampling near likely optima), and is often called Bayesian Optimization (BO). [sent-13, score-0.289]
7 For example, the cost function of neural networks effectively varies only along a few dimensions [2]. [sent-19, score-0.091]
8 To this end, we propose an algorithm that learns a low dimensional, not necessarily axis-aligned, subspace and then applies Bayesian optimization on this estimated subspace. [sent-21, score-0.416]
9 In particular, our SIBO approach combines low-rank matrix recovery with Gaussian Process Upper Confidence Bound sampling in a carefully calibrated manner. [sent-22, score-0.22]
10 We theoretically analyze its performance, and prove bounds on its cumulative regret. [sent-23, score-0.213]
11 To the best of our knowledge, we prove the first subexponential bounds for Bayesian optimization in high dimensions under noisy observations. [sent-24, score-0.369]
12 In contrast to existing approaches, which have an exponential dependence on the ambient dimension, our bounds have in fact polynomial dependence on the dimension. [sent-25, score-0.14]
13 Moreover, our performance guarantees depend explicitly on what we could have achieved if we had known the subspace in advance. [sent-26, score-0.327]
14 Exploration–exploitation tradeoffs were originally studied in the context of finite multi-armed bandits [8]. [sent-28, score-0.092]
15 A more recent algorithm that enjoys theoretical bounds for functions sampled from a Gaussian Process (GP), or belong to some Repro1 ducible Kernel Hilbert Space (RKHS) is GP-UCB [12]. [sent-30, score-0.151]
16 The use of GPs to negotiate exploration– exploitation tradeoffs originated in the areas of response surface and Bayesian optimization, for which there are a number of approaches (cf. [sent-31, score-0.174]
17 Bandit algorithms that exploit low-dimensional structure of the function appeared first for the linear setting, where under sparsity assumptions one can obtain bounds, which depend only weakly on the ambient dimension [16, 17]. [sent-33, score-0.094]
18 They provide bounds on the simple regret under noiseless observations, while we also analyze the cumulative regret and allow noisy observations. [sent-36, score-1.039]
19 Also, unless the low-dimensional space is of dimension 1, our bounds on the simple regret improve on theirs. [sent-37, score-0.451]
20 In [7] the authors approximate functions that live on low-dimensional subspaces using low-rank recovery and analysis techniques. [sent-38, score-0.163]
21 While providing uniform approximation guarantees, their approach is not tailored towards exploration–exploitation tradeoffs, and does not achieve sublinear cumulative regret. [sent-39, score-0.219]
22 o Our specific contributions in this paper can be summarized as follows: • We introduce the SI-BO algorithm for Bayesian bandit optimization in high dimensions, admitting a large family of kernel functions. [sent-41, score-0.215]
23 Our algorithm is a natural but non-trivial fusion of modern low-rank subspace approximation tools with GP optimization methods. [sent-42, score-0.421]
24 • We derive theoretical guarantees on SI-BO’s cumulative and simple regret in highdimensions with noise. [sent-43, score-0.479]
25 To the best of our knowledge, these are the first theoretical results on the sample complexity and regret rates that are subexponential in the ambient dimension. [sent-44, score-0.551]
26 As performance metric, we consider the regret, which tells us how much better we could have done in round t had we known x∗ , or formally rt = f (x∗ ) − f (xt ). [sent-52, score-0.108]
27 Hence, a natural quantity to consider is the cumulative regret defined as RT = t=1 rt . [sent-55, score-0.587]
28 T One can also consider the simple regret, defined as ST = mint=1 rt , measuring the quality of the best solution found so far. [sent-56, score-0.108]
29 We will give bounds on the more challenging notion of cumulative regret, which also bounds the simple regret via ST ≤ RT /T . [sent-57, score-0.633]
30 Formally, we suppose that there exists some function g : Rk → [0, 1] and a matrix A ∈ Rk×d with orthogonal rows so that f (x) = g(Ax). [sent-65, score-0.106]
31 1 To be able to recover the subspace we also need the condition that g has Lipschitz continuous second order derivatives and a full rank Hessian at 0, which is satisfied for many functions [7]. [sent-72, score-0.434]
32 In addition to the low-dimensional subspace assumption, we also assume that g is smooth. [sent-74, score-0.327]
33 for t ← 1 to T − mX (mΦ + 1) do 1/2 ˆ zt ← arg maxz µt (z) + βt σt (z) , yt ← f (AT zt ) + noise , D. [sent-78, score-0.12]
34 The RKHS for some positive semidefinite kernel κ(·, ·) can be constructed by n completing the set of functions i=1 αi κ(xi , ·) under a suitable inner product. [sent-81, score-0.116]
35 We wish to maximize f : Bd (1 + ε) → [0, 1], where f (x) = g(Ax) for some matrix ¯ A ∈ Rk×d with orthogonal rows and g belongs to some RKHS Hκ . [sent-86, score-0.106]
36 The kernel κ is isotropic κ(x, x ) = κ (x − x ) = κ ( x − x 2 ) and κ is continuous, integrable and with a Fourier transform Fκ that is isotropic and radially non-increasing. [sent-88, score-0.287]
37 The oracle noise is Gaussian with zero mean with a known variance σ 2 . [sent-94, score-0.099]
38 3 The SI-BO Algorithm The SI-BO algorithm performs two separate exploration and exploitation stages: (1) subspace identification (SI), i. [sent-95, score-0.542]
39 estimating the subspace on which the function is supported, and then (2) Bayesian optimization (BO), in order to optimize the function on the learned subspace. [sent-97, score-0.432]
40 A key challenge here is to carefully allocate samples between these phases. [sent-98, score-0.17]
41 Given the (noisy) oracle for f , we first evaluate the function at several suitably chosen points and ˆ then use a low-rank recovery algorithm to compute a matrix A that spans a subspace well aligned ˆ with the one generated by the true matrix A. [sent-101, score-0.582]
42 We learn A using the approach from [7], which reduces the learning problem to that of low rank matrix recovery. [sent-106, score-0.111]
43 e 3 then Fκ (w) ≥ Thus, sampling the finite difference f (ξ + εϕ) − f (ξ) provides (up to the curvature error E(ξ, ε, ϕ), and sampling noise) information about the one-dimensional subspace spanned by AT g(Aξ). [sent-117, score-0.391]
44 Further, to infer the full kdimensional subspace A, we need to consider at least mX ≥ k centers. [sent-119, score-0.327]
45 For concreteness, we choose the Dantzig Selector (DS, [24]), which recovers low rank matrices via minimize M M ∗ subject to A∗ (y − A(M )) ≤ λ, (2) residual where · ∗ is the nuclear norm and · is the spectral norm. [sent-123, score-0.13]
46 The DS will successfully recover a ˆ matrix X close to the true solution in the Frobenius norm and moreover this distance decreases linearly with λ. [sent-124, score-0.093]
47 As shown in [7], choosing the centers C uniformly at √ random from the unit sphere Sd−1 , choosing each direction vector uniformly at random from {±1/ mΦ }k , and—in the case of ˆ noisy observations, resampling f repeatedly—suffices to obtain an accurate X w. [sent-125, score-0.098]
48 ˆ close to X, due to a result by Wedin [25] we know that the Because the DS will find a matrix X learned subspace will be close to the true one. [sent-131, score-0.364]
49 Concretely, we use GP-UCB [12], because it exhibits state ¯ of the art empirical performance, and enjoys strong theoretical bounds for the cumulative regret. [sent-134, score-0.256]
50 ˆ In order to trade exploration and exploitation, the GP-UCB algorithm computes, for each point z, a score that combines the predictive mean that we have inferred for that point with its variance, which quantifies the uncertainty in our estimate. [sent-137, score-0.122]
51 Here, B is an upper bound on the squared RKHS norm of the function that we optimize, δ is an upper bound on the failure probability and γt depends on the kernel [12]: cf. [sent-139, score-0.141]
52 3 The algorithm then greedily maximizes the ucb score above. [sent-141, score-0.408]
53 The last ingredient that we need is theory on how to pick σ so that it bounds ˆ the noise during the execution of GP-UCB w. [sent-145, score-0.17]
54 4 10 10 5 5 true subspace 0 g (x) ˆ f(x, y) 0 −5 −5 −10 −10 −15 −15 −20 2 2 1 1 0 −1 y −20 −2 0 −1 −2 −2 −1. [sent-152, score-0.327]
55 5 2 Figure 1: A 2-dimensional function f (x, y) varying along a 1-dimensional subspace and its projections on different subspaces. [sent-156, score-0.327]
56 This bound, intuitively, relates the approximation quality λ of the subspace to the quantities mΦ , mX as well as the step size ε. [sent-161, score-0.364]
57 A crucial choice in our algorithm is how to allocate samples (by choosing mΦ and mX appropriately) to the tasks of subspace learning and function optimization. [sent-163, score-0.42]
58 We now analyze both phases, and determine how to split the queries in order to optimize the cumulative regret bounds. [sent-164, score-0.527]
59 Let us first consider the regret incurred in the second phase, in the ideal (but unrealistic) case that ˆ the subspace is estimated exactly (i. [sent-165, score-0.721]
60 In [12] sublinear bounds for γT have been computed for several popular kernels. [sent-171, score-0.123]
61 ˆ What happens if the subspace A is estimated incorrectly? [sent-178, score-0.327]
62 Moreover, the considered f ˆ additional regret per sample may be incurred by η = ||f − f ||∞ . [sent-184, score-0.394]
63 We now state a general result that formalizes these insights by bounding the cumulative regret in terms of the samples allocated to subspace learning, and the subspace approximation quality. [sent-187, score-1.219]
64 ˆ Lemma 1 Assume that we spend 0 < n ≤ T samples to learn the subspace such that f −f ∞ ≤ η, g ≤ B and the error is bounded by σ , each w. [sent-188, score-0.376]
65 If we run the GP-UCB ˆ ˆ algorithm for the remaining T − n steps with the suggested σ and δ/4, then the following bound on ˆ the cumulative regret holds w. [sent-191, score-0.479]
66 where RUCB (T, g , κ) is the regret of GP-UCB when run for T steps using g and kernel κ 5 . [sent-206, score-0.428]
67 ˆ ˆ Lemma 1 breaks down the regret in terms of the approximation error incurred by subspace2 misestimation, and the optimization error incurred by the resulting increased complexity g Hκ ≤ ˆ B. [sent-207, score-0.539]
68 We now analyze these effects, and then prove our main regret bounds. [sent-208, score-0.343]
69 A notion that will prove to be very helpful for analyzing both, the approximation precision η and the norm of g , is the set of angles between the subspaces that are ˆ ˆ defined by A and A. [sent-210, score-0.188]
70 We ˆ to be equal to the singular define the vector of cosines between the spanned subspaces cos Θ(A, A) ˆ ˆ ˆ values of AAT . [sent-213, score-0.241]
71 Analogously sin Θ(A, A)i = (1 − cos Θ(A, A)2 )1/2 . [sent-214, score-0.09]
72 Lemma 3 If g ∈ Hκ for a kernel that is isotropic with a radially non-increasing Fourier transform √ ˆ ˆ and g (x) = g(AAT x) for some A, A with orthogonal rows, then for C = C2 2k(1 + ε), ˆ ¯ 2 2 −1 ˆ ˆ ˆ . [sent-220, score-0.262]
73 (5) ≤ | prod cos Θ(A, A)| g f −f ≤ C sin Θ(A, A) and g ˆ ∞ Hκ 2 Hκ d Here, we use the notation prod x = i=1 xi to denote the product of the elements of a vector. [sent-221, score-0.184]
74 By ˆ decreasing the angles we tackle both issues: the approximation error η = f − f ∞ is reduced and the norm of g gets closer to the one of g. [sent-222, score-0.125]
75 Hence, g will not be in the RKHS only if M ˆ is rank deficient as dimensions are collapsed. [sent-225, score-0.094]
76 We now present our main bounds on the cumulative regret. [sent-227, score-0.213]
77 As it turns out, subspace learning is substantially harder in the case of noisy observations. [sent-230, score-0.395]
78 The following theorem guarantees ˆ that √ this setting for non-linear kernels we have a regret dominated by GP-UCB, which is of order in Ω∗ ( T γT ), as it is usually exponential in k. [sent-236, score-0.373]
79 1 Theorem 4 If the observations are noiseless we can pick mx = O(kd log 1/δ), ε = k2. [sent-237, score-0.432]
80 5 Because the noise parameter σ depends on T , we have to slightly change the bounds from [12] as we have ˆ a term of order O( log T + log(1/δ)); c. [sent-239, score-0.125]
81 In the noiseless case, it suffices to increase the number of directions mΦ and decrease the step size ε in estimating the finite differences. [sent-245, score-0.104]
82 Nevertheless, we are able to obtain cumulative regret bounds (and thereby the first convergence guarantees and rates) for this setting, which only polynomially depend on d. [sent-249, score-0.588]
83 Unfortunately, the dependence on T is now weaker than those in the noiseless setting (Theorem 4), and the regret due to the subspace learning might dominate that of GP-UCB. [sent-250, score-0.742]
84 Underfitting k leads to additional errors that are well-controlled in low-rank subspace estimation [24]. [sent-262, score-0.327]
85 5 Experiments The main intent of our experiments is to provide a proof of concept, confirming that SI-BO not just in theory provides the first subexponential regret bounds, but also empirically obtains low average regret for Bayesian optimization in high dimensions. [sent-264, score-0.89]
86 To optimize the UCB score we sampled on a grid on the low dimensional subspace. [sent-271, score-0.149]
87 We generate random 2-dimensional samples from a GP with Mat` rn kernel e 2 with smoothness parameter ν = 5/2, length scale = 1/2 and signal variance σf = 1. [sent-276, score-0.134]
88 The samples are “hidden” in a random 2-dimensional subspace in 100 dimensions. [sent-277, score-0.376]
89 We show both the average regret and simple regret (i. [sent-286, score-0.686]
90 We find that although SI-BO spends a total of mX (mΦ + 1) samples to learn the subspace and thus incurs 7 1 0. [sent-289, score-0.376]
91 Our SI-BO approach outperforms the natural benchmarks in terms of cumulative regret, and competes well with the unrealistic ExactUCB approach that knows the true subspace A. [sent-314, score-0.497]
92 much regret during this phase, learning the subspace pays off, both for average and simple regret, and SI-BO ultimately outperforms the baseline methods on both data sets. [sent-315, score-0.67]
93 This demonstrates the value of accurate subspace estimation for Bayesian optimization in high dimensions. [sent-316, score-0.384]
94 6 Conclusion We have addressed the problem of optimizing high dimensional functions from noisy and expensive samples. [sent-325, score-0.164]
95 We presented the SI-BO algorithm, which tackles this challenge under the assumption that the objective varies only along a low dimensional subspace, and has low norm in a suitable RKHS. [sent-326, score-0.224]
96 By fusing modern techniques for low rank matrix recovery and Bayesian bandit optimization in a carefully calibrated manner, it addresses the exploration–exploitation dilemma, and enjoys cumulative regret bounds, which only polynomially depend on the ambient dimension. [sent-327, score-1.009]
97 Information-theoretic regret bounds for gaussian process optimization in the bandit setting. [sent-402, score-0.585]
98 A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. [sent-409, score-0.154]
99 Joint optimization and variable selection of high-dimensional gaussian processes. [sent-437, score-0.092]
100 Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. [sent-478, score-0.225]
wordName wordTfidf (topN-words)
[('ucb', 0.374), ('regret', 0.343), ('subspace', 0.327), ('mx', 0.315), ('rkhs', 0.174), ('aat', 0.16), ('cumulative', 0.136), ('exploitation', 0.127), ('gab', 0.115), ('randomh', 0.115), ('randoms', 0.115), ('rucb', 0.115), ('subexponential', 0.115), ('rt', 0.108), ('bo', 0.105), ('snf', 0.101), ('ax', 0.1), ('cos', 0.09), ('gp', 0.089), ('exploration', 0.088), ('kernel', 0.085), ('bounds', 0.077), ('radially', 0.076), ('bandit', 0.073), ('noiseless', 0.072), ('recovery', 0.069), ('noisy', 0.068), ('bayesian', 0.067), ('ambient', 0.063), ('isotropic', 0.063), ('subspaces', 0.063), ('bd', 0.063), ('ds', 0.06), ('tyagi', 0.057), ('xds', 0.057), ('optimization', 0.057), ('norm', 0.056), ('mat', 0.056), ('dimensions', 0.052), ('incurred', 0.051), ('oracle', 0.051), ('cosines', 0.051), ('gait', 0.051), ('ys', 0.05), ('rbf', 0.049), ('samples', 0.049), ('noise', 0.048), ('optimize', 0.048), ('tradeoffs', 0.047), ('carefully', 0.047), ('prod', 0.047), ('sublinear', 0.046), ('pick', 0.045), ('bandits', 0.045), ('krause', 0.045), ('allocate', 0.044), ('cevher', 0.044), ('enjoys', 0.043), ('notoriously', 0.042), ('rank', 0.042), ('resample', 0.04), ('selector', 0.04), ('varies', 0.039), ('dantzig', 0.038), ('orthogonal', 0.038), ('approximation', 0.037), ('rk', 0.037), ('matrix', 0.037), ('eth', 0.037), ('singular', 0.037), ('zt', 0.036), ('lipschitz', 0.035), ('hilbert', 0.035), ('calibrated', 0.035), ('gaussian', 0.035), ('dimensional', 0.035), ('compact', 0.035), ('score', 0.034), ('derivatives', 0.034), ('unrealistic', 0.034), ('polynomially', 0.032), ('angles', 0.032), ('sampling', 0.032), ('directions', 0.032), ('low', 0.032), ('sd', 0.032), ('aligned', 0.031), ('erc', 0.031), ('dimension', 0.031), ('functions', 0.031), ('rows', 0.031), ('runs', 0.03), ('rates', 0.03), ('analogously', 0.03), ('kernels', 0.03), ('challenge', 0.03), ('expensive', 0.03), ('suitably', 0.03), ('acknowledges', 0.03), ('centers', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
2 0.26859877 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
3 0.20522219 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
4 0.1667693 230 nips-2013-Online Learning with Costly Features and Labels
Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari
Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label significantly increases the regret of the problem. 1
5 0.16324966 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
6 0.1618557 224 nips-2013-On the Sample Complexity of Subspace Learning
7 0.15386623 91 nips-2013-Dirty Statistical Models
8 0.14864537 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
9 0.14700234 89 nips-2013-Dimension-Free Exponentiated Gradient
10 0.1446752 325 nips-2013-The Pareto Regret Frontier
11 0.14420085 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
12 0.14267036 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
13 0.13878028 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
14 0.13739033 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
15 0.13635956 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
16 0.13250209 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
17 0.13247848 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
18 0.12584588 233 nips-2013-Online Robust PCA via Stochastic Optimization
19 0.11879069 188 nips-2013-Memory Limited, Streaming PCA
20 0.11738504 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
topicId topicWeight
[(0, 0.288), (1, -0.068), (2, 0.241), (3, -0.06), (4, -0.05), (5, -0.073), (6, 0.056), (7, -0.044), (8, -0.161), (9, -0.047), (10, -0.08), (11, 0.115), (12, -0.089), (13, 0.058), (14, 0.127), (15, 0.013), (16, -0.067), (17, 0.028), (18, -0.037), (19, -0.136), (20, 0.011), (21, -0.038), (22, -0.04), (23, -0.015), (24, 0.106), (25, 0.041), (26, -0.049), (27, -0.052), (28, 0.011), (29, 0.048), (30, -0.051), (31, -0.019), (32, 0.061), (33, -0.006), (34, -0.06), (35, 0.08), (36, 0.113), (37, -0.013), (38, 0.118), (39, -0.075), (40, 0.031), (41, 0.018), (42, -0.134), (43, 0.056), (44, 0.092), (45, 0.066), (46, 0.042), (47, 0.035), (48, -0.008), (49, -0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.9470042 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
2 0.66011661 224 nips-2013-On the Sample Complexity of Subspace Learning
Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco
Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1
3 0.65395492 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
4 0.64317435 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
5 0.63582903 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
Author: Mohammad Gheshlaghi azar, Alessandro Lazaric, Emma Brunskill
Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi–armed bandit framework, where the objective is to minimize the total regret over a sequence of tasks by transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for estimating the possible tasks and derive regret bounds for it. 1
6 0.60140836 91 nips-2013-Dirty Statistical Models
7 0.59999084 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
8 0.58579755 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
9 0.56831491 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
10 0.5504387 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
11 0.54800808 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
12 0.5371027 230 nips-2013-Online Learning with Costly Features and Labels
13 0.53421015 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
14 0.52758819 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
15 0.52372622 325 nips-2013-The Pareto Regret Frontier
16 0.52258354 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
17 0.5172407 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
18 0.50262934 89 nips-2013-Dimension-Free Exponentiated Gradient
19 0.49733624 188 nips-2013-Memory Limited, Streaming PCA
20 0.49459729 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
topicId topicWeight
[(2, 0.02), (16, 0.029), (19, 0.027), (33, 0.178), (34, 0.116), (41, 0.027), (49, 0.035), (53, 0.147), (56, 0.191), (70, 0.022), (85, 0.038), (89, 0.045), (93, 0.033), (95, 0.028)]
simIndex simValue paperId paperTitle
1 0.95369995 205 nips-2013-Multisensory Encoding, Decoding, and Identification
Author: Aurel A. Lazar, Yevgeniy Slutskiy
Abstract: We investigate a spiking neuron model of multisensory integration. Multiple stimuli from different sensory modalities are encoded by a single neural circuit comprised of a multisensory bank of receptive fields in cascade with a population of biophysical spike generators. We demonstrate that stimuli of different dimensions can be faithfully multiplexed and encoded in the spike domain and derive tractable algorithms for decoding each stimulus from the common pool of spikes. We also show that the identification of multisensory processing in a single neuron is dual to the recovery of stimuli encoded with a population of multisensory neurons, and prove that only a projection of the circuit onto input stimuli can be identified. We provide an example of multisensory integration using natural audio and video and discuss the performance of the proposed decoding and identification algorithms. 1
same-paper 2 0.92436492 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
3 0.89917552 75 nips-2013-Convex Two-Layer Modeling
Author: Özlem Aslan, Hao Cheng, Xinhua Zhang, Dale Schuurmans
Abstract: Latent variable prediction models, such as multi-layer networks, impose auxiliary latent variables between inputs and outputs to allow automatic inference of implicit features useful for prediction. Unfortunately, such models are difficult to train because inference over latent variables must be performed concurrently with parameter optimization—creating a highly non-convex problem. Instead of proposing another local training method, we develop a convex relaxation of hidden-layer conditional models that admits global training. Our approach extends current convex modeling approaches to handle two nested nonlinearities separated by a non-trivial adaptive latent layer. The resulting methods are able to acquire two-layer models that cannot be represented by any single-layer model over the same features, while improving training quality over local heuristics. 1
4 0.88881093 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
Author: Adel Javanmard, Andrea Montanari
Abstract: Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a ‘de-biased’ version of regularized Mestimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem. 1
5 0.8869527 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
6 0.8831104 224 nips-2013-On the Sample Complexity of Subspace Learning
8 0.8820501 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
9 0.88165534 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
10 0.88155317 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
11 0.88129699 344 nips-2013-Using multiple samples to learn mixture models
12 0.88125044 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation
13 0.88119471 355 nips-2013-Which Space Partitioning Tree to Use for Search?
14 0.88107061 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
15 0.88012612 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
16 0.87863535 230 nips-2013-Online Learning with Costly Features and Labels
17 0.8784371 318 nips-2013-Structured Learning via Logistic Regression
18 0.87710035 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
19 0.87595952 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
20 0.8759048 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization