nips nips2013 nips2013-106 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. [sent-3, score-0.172]
2 The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. [sent-5, score-0.233]
3 In this paper, we develop a regret bound that holds for both classes of algorithms. [sent-6, score-0.271]
4 It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. [sent-8, score-0.705]
5 Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. [sent-9, score-0.444]
6 This is the essence of what we call the eluder dimension. [sent-13, score-0.297]
7 We show this notion controls the rate at which algorithms using optimistic exploration converge to optimality. [sent-14, score-0.206]
8 We consider an optimization problem faced by an agent who is uncertain about how her actions influence performance. [sent-15, score-0.283]
9 The agent selects actions sequentially, and upon each action observes a reward. [sent-16, score-0.569]
10 A reward function governs the mean reward of each action. [sent-17, score-0.386]
11 As rewards are observed the agent learns about the reward function, and this allows her to improve behavior. [sent-18, score-0.299]
12 Good performance requires adaptively sampling actions in a way that strikes an effective balance between exploring poorly understood actions and exploiting previously acquired knowledge to attain high rewards. [sent-19, score-0.475]
13 Unless the agent has prior knowledge of the structure of the mean payoff function, she can only learn to attain near optimal performance by exhaustively sampling each possible action. [sent-20, score-0.258]
14 Problems of this form are often referred to as multi-armed bandit (MAB) problems with dependent arms. [sent-22, score-0.172]
15 A notable example is the “linear bandit” problem, where actions are described by a finite number of features and the reward function is linear in these features. [sent-23, score-0.397]
16 Several researchers have studied algorithms for such problems and established theoretical guarantees that have no dependence on the number of actions [1, 2, 3]. [sent-24, score-0.226]
17 Instead, their bounds depend on the linear dimension of the class of reward functions. [sent-25, score-0.47]
18 In this paper, we assume that the reward function lies in a known but otherwise arbitrary class of uniformly bounded real-valued functions, and provide theoretical guarantees that 1 depend on more general measures of the complexity of the class of functions. [sent-26, score-0.353]
19 The aforementioned papers on the linear bandit problem study UCB algorithms [1, 2, 3]. [sent-30, score-0.223]
20 Other authors have studied UCB algorithms in cases where the reward function is Lipschitz continuous [4, 5], sampled from a Gaussian process [6], or takes the form of a generalized [7] or sparse [8] linear model. [sent-31, score-0.28]
21 More generally, there is an immense literature on this approach to balancing between exploration and exploitation, including work on bandits with independent arms [9, 10, 11, 12], reinforcement learning [13, 14], and Monte Carlo Tree Search [15]. [sent-32, score-0.198]
22 Recently, a simple posterior sampling algorithm called Thompson sampling was shown to share a close connection with UCB algorithms [16]. [sent-33, score-0.174]
23 A recent paper considers the application of this algorithm to a linear contextual bandit problem [23]. [sent-38, score-0.282]
24 There is work that provides general bounds for contextual bandit problems where the context space is allowed to be infinite, but the action space is small (see e. [sent-40, score-0.601]
25 Our model captures contextual bandits as a special case, but we emphasize problem instances with large or infinite action sets, and where the goal is to learn without sampling every possible action. [sent-43, score-0.557]
26 They provide bounds based on a new notion of dimension, but unfortunately this notion does not provide a guarantee for the algorithms we consider. [sent-46, score-0.213]
27 We provide bounds on expected regret over a time horizon T that are, up to a logarithmic factor, of order dimE F, T −2 log N F, T −2 , · Eluder dimension ∞ T. [sent-47, score-0.412]
28 log–covering number This quantity depends on the class of reward functions F through two measures of complexity. [sent-48, score-0.322]
29 The first measures the growth rate of the covering numbers of F, and is closely related to measures of complexity that are common in the supervised learning literature. [sent-50, score-0.184]
30 The second measure, the eluder dimension, is a new notion we introduce. [sent-52, score-0.362]
31 This captures how effectively the value of unobserved actions can be inferred from observed samples. [sent-53, score-0.24]
32 1 why notions of dimension common to the supervised learning literature are insufficient for our purposes. [sent-55, score-0.186]
33 Finally, we show that our more general result when specialized to linear models recovers the strongest known regret bound and in the case of generalized linear models yields a bound stronger than that established in prior literature. [sent-56, score-0.492]
34 2 Problem Formulation We consider a model involving a set of actions A and a set of real-valued functions F = {fρ : A → R| ρ ∈ Θ}, indexed by a parameter that takes values from an index set Θ. [sent-57, score-0.213]
35 A random variable θ indexes the true reward function fθ . [sent-59, score-0.193]
36 At each time t, the agent is presented with a possibly random subset At ⊆ A and selects an action At ∈ At , after which she observes a reward Rt . [sent-60, score-0.585]
37 , At−1 , At−1 , Rt−1 , At ) of observations available to the agent when choosing an action At . [sent-64, score-0.369]
38 The agent employs a policy π = {πt |t ∈ N}, which is a deterministic sequence of functions, each mapping the history Ht to a probability distribution over actions A. [sent-65, score-0.339]
39 The action At 2 is selected by sampling from the distribution πt (·), so that P(At ∈ ·|Ht ) = πt (Ht ). [sent-67, score-0.35]
40 In other words, the realized reward is the mean-reward value corrupted by zero-mean noise. [sent-69, score-0.193]
41 We let A∗ ∈ arg maxa∈At fθ (a) denote the optimal action at time t. [sent-79, score-0.309]
42 The T period regret is the t random variable T [fθ (A∗ ) − fθ (At )] , t R(T, π) = t=1 where the actions {At : t ∈ N} are selected according to π. [sent-80, score-0.368]
43 We sometimes study expected regret E[R(T, π)], where the expectation is taken over the prior distribution of θ, the reward noise, and the algorithm’s internal randomization. [sent-81, score-0.384]
44 Similarly, we study conditional expected regret E [R(T, π) | θ], which integrates over all randomness in the system except for θ. [sent-83, score-0.191]
45 The contextual multi-armed bandit model is a special case of the formulation presented above. [sent-86, score-0.255]
46 In particular, the expected reward at time t is given by fθ (a, Xt ). [sent-88, score-0.193]
47 In particular, one can define the set of actions to be the set of state-action pairs A := {(x, a) : x ∈ A, a ∈ A(x)}, and the set of available actions to be At = {(Xt , a) : a ∈ A(Xt )}. [sent-90, score-0.354]
48 3 Algorithms We will establish performance bounds for two classes of algorithms: Thompson sampling and UCB algorithms. [sent-91, score-0.201]
49 At each time t, such an algorithm selects the action At ∈ arg max sup f (a), a∈At f ∈Ft where sup f (a) is an optimistic estimate of fθ (a) representing the greatest value that is statistically f ∈Ft plausible at time t. [sent-99, score-0.474]
50 Thompson sampling: The Thompson sampling algorithm simply samples each action according to the probability it is optimal. [sent-106, score-0.35]
51 In particular, the algorithm applies action sampling distributions TS πt (Ht ) = P (A∗ ∈ · | Ht ), where A∗ is a random variable that satisfies A∗ ∈ arg maxa∈At fθ (a). [sent-107, score-0.396]
52 t t t ˆ Practical implementations typically operate by at each time t sampling an index θt ∈ Θ from the distribution P (θ ∈ · | Ht ) and then generating an action At ∈ arg maxa∈At fθt (a). [sent-108, score-0.396]
53 In particular, there is a known feature mapping φ : A → Rd such that an action a yields expected reward fθ (a) = φ(a), θ . [sent-110, score-0.456]
54 the upper confidence bound Ut (a) := maxθ∈Θt φ(a), θ = φ(a), θ Φ t The term φ(a) Φ−1 t captures the amount of previous exploration in the direction φ(a), and causes the “uncertainty bonus” β d log(t) φ(a) Φ−1 t to diminish as the number of observations increases. [sent-113, score-0.18]
55 We consider a linear reward function fθ (a) = φ(a), θ and assume the reward noise Rt − fθ (At ) is normally distributed and independent from (Ht , At , θ). [sent-116, score-0.413]
56 4 Notions of Dimension Recently, there has been a great deal of interest in the development of regret bounds for linear UCB algorithms [1, 2, 3, 26]. [sent-120, score-0.301]
57 An interesting feature of these bounds is that they have no dependence on the number actions in A, and instead depend only on the linear dimension of the set of functions F. [sent-122, score-0.477]
58 Our goal is to provide bounds that depend on more general measures of the complexity of the class of functions. [sent-123, score-0.204]
59 This section introduces a new notion, the eluder dimension, on which our bounds will depend. [sent-124, score-0.38]
60 First, we highlight why common notions from statistical learning theory do not suffice when it comes to multi–armed bandit problems. [sent-125, score-0.245]
61 , n}, it is easy to see that the regret of any algorithm grows linearly with n. [sent-141, score-0.219]
62 For large n, until θ is discovered, each sampled action is unlikely to reveal much about θ and learning therefore takes very long. [sent-142, score-0.286]
63 ˜ Consider the closely related supervised learning problem in which at each time an action At is ˜t ) is observed. [sent-143, score-0.305]
64 To highlight conceptual differences between the eluder dimension and the VC dimension, we will now define VC dimension in a way analogous to how will define eluder dimension. [sent-149, score-0.848]
65 An action a is VC-independent of A ⊆ A if for any f, f ∈ F there exists some f ∈ F ˜ on A; that is, f (a) = f (a) and f (˜) = f (˜) for all a ∈ A. [sent-152, score-0.263]
66 ˜ By this definition, an action a is said to be VC-dependent on A if knowing the values f ∈ F takes ˜ could restrict the set of possible values at a. [sent-154, score-0.263]
67 This notion of independence is intimately related on A to the VC dimension of a class of functions. [sent-155, score-0.204]
68 The VC dimension of a class of binary-valued functions with domain A is the largest ˜ ˜ ˜ cardinality of a set A ⊆ A such that every a ∈ A is VC-independent of A\ {a}. [sent-158, score-0.175]
69 In the above example, any two actions are VC-dependent because knowing the label fθ (a) of one action could completely determine the value of the other action. [sent-159, score-0.44]
70 However, this only happens if the sampled action has label 1. [sent-160, score-0.286]
71 2 Defining Eluder Dimension Here we define the eluder dimension of a class of functions, which plays a key role in our results. [sent-164, score-0.436]
72 Intuitively, an action a is independent of {a1 , . [sent-175, score-0.263]
73 The above definition measures the “similarity” of predictions at -scale, and measures whether two functions make similar n ˜ predictions at {a1 , . [sent-182, score-0.216]
74 Recall that a vector space has dimension d if and only if d is the length of the longest sequence of elements such that each element is linearly independent or equivalently, 0-independent of its predecessors. [sent-189, score-0.187]
75 5 Confidence Bounds and Regret Decompositions A key to our analysis is recent observation [16] that the regret of both Thompson sampling and a ˜ UCB algorithm can be decomposed in terms of confidence sets. [sent-192, score-0.278]
76 Define the width of a subset F ⊂ F at an action a ∈ A by wF (a) = sup f (a) − f (a) . [sent-193, score-0.351]
77 / (3) t=1 If the confidence sets Ft are constructed to contain fθ with high probability, this proposition essenT tially bounds regret in terms of the sum of widths t=1 wFt (At ). [sent-198, score-0.375]
78 In this sense, the decomposition bounds regret only in terms of uncertainty about the actions A1 ,. [sent-199, score-0.476]
79 As actions are sampled, the value of fθ (·) at those actions is learned accurately, and hence we expect that the width wFt (·) of the confidence sets should diminish over time. [sent-202, score-0.472]
80 It is worth noting that the regret bound of the UCB algorithm π F1:∞ depends on the specific confidence sets {Ft : t ∈ N} used by the algorithm whereas the bound of π TS applies for any sequence of confidence sets. [sent-203, score-0.349]
81 Then, in Section 7 we give a worst case bound on the sum t=1 wFt (At ) in terms of the eluder dimension of the class of functions F. [sent-207, score-0.551]
82 When combined with Proposition 1, this analysis provides regret bounds for both Thompson sampling and for a UCB algorithm. [sent-208, score-0.361]
83 When an action At is sampled, the ellipsoid shrinks in the direction φ(At ). [sent-232, score-0.323]
84 Here the explicit geometric structure of the confidence set implies that the width wFt shrinks not only at At but also at any other action whose feature vector is not orthogonal to φ(At ). [sent-233, score-0.334]
85 For a general class of functions, the situation is much subtler, and we need to measure the way in which the width at each action can be reduced by sampling other actions. [sent-235, score-0.435]
86 The following result uses our new notion of dimension to bound the number of times the width of the confidence interval for a selected action At can exceed a threshold. [sent-236, score-0.523]
87 Recall, for a sequence of conF1:∞ ¯ fidence sets {Ft : t ∈ N} we denote by π the UCB algorithm that chooses an action At ∈ arg maxa∈A supf ∈Ft fθ (a) at each time t. [sent-244, score-0.369]
88 We establish bounds that are, up to a logarithmic factor, of order dimE F, T −2 log N F, T −2 , · Eluder dimension log–covering number 7 ∞ T. [sent-245, score-0.221]
89 The second measure of complexity, the eluder dimension, measures the extent to which the reward value at one action can be inferred by sampling other actions. [sent-249, score-0.92]
90 For any T ∈ N, E R(T, π TS ) ≤ B(F, T, T −1 ) + 2C The next two examples show how the regret bounds of Proposition 4 and 5 specialize to ddimensional linear and generalized linear models. [sent-256, score-0.402]
91 For each of these examples Θ ⊂ Rd and each action is associated with a known feature vector φ(a). [sent-257, score-0.263]
92 Proposi)) F F tions 4 and 5 therefore yield O(d log(1/αT ) T ) regret bounds. [sent-263, score-0.191]
93 Then, ˜ ˜ 2 log N (F, , · ∞ ) = O(d log(h/ )) and dimE (F, ) = O(dr log(h/ )), and Propositions 4 and √ F 5 yield O(rd log(h/αT ) T ) regret bounds. [sent-268, score-0.229]
94 To our knowledge, this bound is a slight improvement over the strongest regret bound available for any algorithm in this setting. [sent-269, score-0.319]
95 9 Conclusion In this paper, we have analyzed two algorithms, Thompson sampling and a UCB algorithm, in a very general framework, and developed regret bounds that depend on a new notion of dimension. [sent-272, score-0.454]
96 In constructing these bounds, we have identified two factors that control the hardness of a particular multi-armed bandit problem. [sent-273, score-0.172]
97 First, an agent’s ability to quickly attain near-optimal performance depends on the extent to which the reward value at one action can be inferred by sampling other actions. [sent-274, score-0.603]
98 However, in order to select an action the agent must make inferences about many possible actions, and an error in its evaluation of any one could result in large regret. [sent-275, score-0.369]
99 Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. [sent-320, score-0.446]
100 Analysis of Thompson sampling for the multi-armed bandit problem. [sent-409, score-0.259]
wordName wordTfidf (topN-words)
[('ucb', 0.313), ('thompson', 0.301), ('eluder', 0.297), ('wft', 0.264), ('action', 0.263), ('ft', 0.247), ('dime', 0.233), ('reward', 0.193), ('regret', 0.191), ('actions', 0.177), ('ht', 0.175), ('dence', 0.174), ('bandit', 0.172), ('ftls', 0.144), ('agent', 0.106), ('maxa', 0.102), ('dimension', 0.1), ('bandits', 0.087), ('sampling', 0.087), ('bounds', 0.083), ('contextual', 0.083), ('vc', 0.08), ('mab', 0.07), ('con', 0.069), ('proposition', 0.068), ('notion', 0.065), ('szepesv', 0.064), ('optimistic', 0.058), ('rt', 0.058), ('exploration', 0.055), ('measures', 0.054), ('bound', 0.049), ('rusmevichientong', 0.048), ('agrawal', 0.047), ('width', 0.046), ('arg', 0.046), ('stanford', 0.044), ('notions', 0.044), ('filippi', 0.042), ('sup', 0.042), ('supervised', 0.042), ('discrepancy', 0.041), ('class', 0.039), ('diminish', 0.039), ('goto', 0.039), ('log', 0.038), ('captures', 0.037), ('generalized', 0.037), ('russo', 0.037), ('ddimensional', 0.037), ('garivier', 0.037), ('functions', 0.036), ('predictions', 0.036), ('capp', 0.035), ('ellipsoid', 0.035), ('attain', 0.034), ('covering', 0.034), ('sets', 0.033), ('optimism', 0.032), ('longest', 0.032), ('nondecreasing', 0.032), ('rd', 0.032), ('ak', 0.032), ('classes', 0.031), ('amin', 0.031), ('payoff', 0.031), ('increment', 0.031), ('stronger', 0.03), ('strongest', 0.03), ('worst', 0.03), ('lipschitz', 0.03), ('specialized', 0.029), ('history', 0.029), ('highlight', 0.029), ('arms', 0.028), ('depend', 0.028), ('ai', 0.028), ('controls', 0.028), ('propositions', 0.028), ('munos', 0.028), ('linearly', 0.028), ('reinforcement', 0.028), ('sequence', 0.027), ('linear', 0.027), ('dependence', 0.026), ('inferred', 0.026), ('shrinks', 0.025), ('broadly', 0.025), ('ts', 0.025), ('lemma', 0.025), ('uncertainty', 0.025), ('conceptual', 0.025), ('papers', 0.024), ('xt', 0.024), ('selects', 0.023), ('arxiv', 0.023), ('cumulative', 0.023), ('sampled', 0.023), ('established', 0.023), ('measurable', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
2 0.32434514 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
3 0.27406028 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
4 0.26859877 137 nips-2013-High-Dimensional Gaussian Process Bandits
Author: Josip Djolonga, Andreas Krause, Volkan Cevher
Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1
5 0.24412963 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
6 0.24114515 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
7 0.23527104 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
8 0.22319154 89 nips-2013-Dimension-Free Exponentiated Gradient
9 0.21688434 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
10 0.21046102 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
11 0.20534599 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
12 0.19753507 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
13 0.18761253 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
14 0.17228775 7 nips-2013-A Gang of Bandits
15 0.16618331 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
16 0.15210047 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
17 0.15137532 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
18 0.13618405 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
19 0.13504077 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
20 0.12697838 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
topicId topicWeight
[(0, 0.254), (1, -0.291), (2, 0.159), (3, -0.171), (4, 0.006), (5, -0.141), (6, 0.182), (7, -0.146), (8, -0.12), (9, -0.006), (10, 0.007), (11, 0.15), (12, -0.056), (13, -0.071), (14, -0.045), (15, 0.107), (16, 0.001), (17, 0.035), (18, 0.051), (19, -0.101), (20, -0.031), (21, 0.106), (22, 0.009), (23, -0.021), (24, 0.06), (25, 0.007), (26, -0.018), (27, -0.164), (28, -0.002), (29, 0.057), (30, 0.049), (31, -0.121), (32, -0.045), (33, 0.094), (34, -0.052), (35, -0.059), (36, 0.047), (37, 0.002), (38, 0.113), (39, -0.028), (40, 0.11), (41, -0.095), (42, 0.013), (43, 0.049), (44, 0.028), (45, -0.032), (46, -0.044), (47, 0.049), (48, 0.046), (49, 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.94914263 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
2 0.75235325 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
3 0.73969793 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
Author: Sebastien Bubeck, Che-Yu Liu
Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1
4 0.73358041 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
5 0.65447098 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
Author: Nathaniel Korda, Emilie Kaufmann, Remi Munos
Abstract: Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (through the Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families. 1
6 0.63962168 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
7 0.61723572 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
8 0.61370796 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
9 0.60113943 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
10 0.58667171 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
11 0.55326486 137 nips-2013-High-Dimensional Gaussian Process Bandits
12 0.54231381 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
13 0.53159541 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
14 0.53008163 89 nips-2013-Dimension-Free Exponentiated Gradient
15 0.52179527 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
16 0.51037955 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
17 0.49204236 7 nips-2013-A Gang of Bandits
18 0.48418573 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
19 0.45476747 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
20 0.44238129 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
topicId topicWeight
[(16, 0.019), (33, 0.146), (34, 0.071), (41, 0.01), (49, 0.026), (56, 0.542), (85, 0.052), (89, 0.028), (93, 0.014), (95, 0.013)]
simIndex simValue paperId paperTitle
1 0.9963215 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
Author: Amit Daniely, Nati Linial, Shai Shalev-Shwartz
Abstract: The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a computational resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a natural supervised learning problem — we consider agnostic PAC learning of halfspaces over 3-sparse vectors in {−1, 1, 0}n . This class is inefficiently learnable using O n/ 2 examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random 3CNF formulas is hard, it is impossible to efficiently learn this class using only O n/ 2 examples. We further show that under stronger hardness assumptions, even O n1.499 / 2 examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently ˜ using Ω n2 / 2 examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem. 1
2 0.99572325 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
3 0.98885983 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
Author: Franz Kiraly, Louis Theran
Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1
same-paper 4 0.98409158 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
5 0.98289043 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
6 0.96836245 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
7 0.95604694 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
8 0.95326596 269 nips-2013-Regression-tree Tuning in a Streaming Setting
9 0.94254977 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
10 0.92418551 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
11 0.91503966 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
12 0.90678239 42 nips-2013-Auditing: Active Learning with Outcome-Dependent Query Costs
13 0.89642996 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
14 0.89214957 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
15 0.8915903 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
16 0.89087284 66 nips-2013-Computing the Stationary Distribution Locally
17 0.88960254 230 nips-2013-Online Learning with Costly Features and Labels
18 0.87733209 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
19 0.87587112 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
20 0.8704496 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence