nips nips2013 nips2013-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. [sent-3, score-1.048]
2 We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. [sent-4, score-0.525]
3 We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. [sent-5, score-0.248]
4 1 Introduction A growing body of work on efficient reinforcement learning provides algorithms with guarantees on sample and computational efficiency [13, 6, 2, 22, 4, 9]. [sent-6, score-0.383]
5 This literature highlights the point that an effective exploration scheme is critical to the design of any efficient reinforcement learning algorithm. [sent-7, score-0.429]
6 In particular, popular exploration schemes such as ✏-greedy, Boltzmann, and knowledge gradient can require learning times that grow exponentially in the number of states and/or the planning horizon. [sent-8, score-0.228]
7 The aforementioned literature focuses on tabula rasa learning; that is, algorithms aim to learn with little or no prior knowledge about transition probabilities and rewards. [sent-9, score-0.4]
8 Such algorithms require learning times that grow at least linearly with the number of states. [sent-10, score-0.037]
9 Despite the valuable insights that have been generated through their design and analysis, these algorithms are of limited practical import because state spaces in most contexts of practical interest are enormous. [sent-11, score-0.318]
10 There is a need for algorithms that generalize from past experience in order to learn how to make effective decisions in reasonable time. [sent-12, score-0.117]
11 There has been much work on reinforcement learning algorithms that generalize (see, e. [sent-13, score-0.293]
12 Most of these algorithms do not come with statistical or computational efficiency guarantees, though there are a few noteworthy exceptions, which we now discuss. [sent-16, score-0.05]
13 Though interesting results have been produced in this line of work, each entails quite restrictive assumptions or does not make strong guarantees. [sent-18, score-0.036]
14 An algorithm is proposed in [12] that fits a factored model to observed data and makes decisions based on the fitted model. [sent-20, score-0.098]
15 The authors establish a sample complexity bound that is polynomial in the number of model parameters rather than the number of states, but the algorithm is computationally intractable because of the difficulty of solving factored MDPs. [sent-21, score-0.164]
16 Though this result is theoretically interesting, for most model classes of interest, the ✏-covering-number is enormous since it typically grows exponentially in the number of free parameters. [sent-23, score-0.134]
17 Another recent paper [17] establishes a regret bound for an algorithm that applies to problems with continuous state spaces and H¨ lder-continuous rewards and transition kernels. [sent-24, score-0.437]
18 o Though the results represent an interesting contribution to the literature, a couple features of the regret bound weaken its practical implications. [sent-25, score-0.335]
19 First, regret grows linearly with the H¨ lder constant o of the transition kernel, which for most contexts of practical relevance grows exponentially in the number of state variables. [sent-26, score-0.629]
20 Second, the dependence on time becomes arbitrarily close to linear as the dimension of the state space grows. [sent-27, score-0.108]
21 Reinforcement learning in linear systems with quadratic cost is treated in [1]. [sent-28, score-0.038]
22 The method proposed is shown to realize regret that grows with the square root of time. [sent-29, score-0.199]
23 Here, there are efficiency guarantees that scale gracefully with the number of state variables, but only under sparsity and other technical assumptions. [sent-32, score-0.221]
24 The most popular approach to generalization in the applied reinforcement learning literature involves fitting parameterized value functions. [sent-33, score-0.361]
25 Such approaches relate closely to supervised learning in that they learn functions from state to value, though a difference is that value is influenced by action and observed only through delayed feedback. [sent-34, score-0.399]
26 One advantage over model learning approaches is that, given a fitted value function, decisions can be made without solving a potentially intractable control problem. [sent-35, score-0.053]
27 We see this as a promising direction, though there currently is a lack of theoretical results that provide attractive bounds on learning time with value function generalization. [sent-36, score-0.05]
28 A relevant paper along this research line is [15], which studies the efficient reinforcement learning with value function generalization in the KWIK framework (see [16]), and reduces the efficient reinforcement learning problem to the efficient KWIK online regression problem. [sent-37, score-0.621]
29 Thus, though the result of [15] is interesting, it does not provide a provably efficient algorithm. [sent-39, score-0.083]
30 An important challenge that remains is to couple exploration and value function generalization in a provably effective way, and in particular, to establish sample and computational efficiency guarantees that scale gracefully with the planning horizon and model complexity. [sent-40, score-0.549]
31 To start with a simple context, we restrict our attention to deterministic systems that evolve over finite time horizons, and we consider episodic learning, in which an agent repeatedly interacts with the same system. [sent-42, score-0.278]
32 As a solution to the problem, we propose optimistic constraint propagation (OCP), a computationally efficient reinforcement learning algorithm designed to synthesize efficient exploration and value function generalization. [sent-43, score-0.749]
33 We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes. [sent-44, score-0.406]
34 Here, dimE denotes the eluder dimension, which quantifies complexity of the hypothesis class. [sent-45, score-0.227]
35 A corollary of this result is that regret is bounded by a function that is constant over time and linear in the problem horizon and eluder dimension. [sent-46, score-0.368]
36 To put our aforementioned result in perspective, it is useful to relate it to other lines of work. [sent-47, score-0.104]
37 Consider first the broad area of reinforcement learning algorithms that fit value functions, such as SARSA [19]. [sent-48, score-0.293]
38 Even with the most commonly used sort of hypothesis class Q, which is made up of linear combinations of fixed basis functions, and even when the hypothesis class contains the true value function Q⇤ , there are no guarantees that these algorithms will efficiently learn to make near-optimal decisions. [sent-49, score-0.347]
39 On the other hand, our result implies that OCP attains near-optimal performance in time that scales linearly with the number of basis functions. [sent-50, score-0.044]
40 Now consider the more specialized context of a deterministic linear system with quadratic cost and a finite time horizon. [sent-51, score-0.162]
41 The analysis of [1] can be leveraged to produce regret bounds that scale exponentially in the number of state variables. [sent-52, score-0.309]
42 On the other hand, using a hypothesis space Q consisting of quadratic functions of state-action pairs, the results of this paper show that OCP behaves near optimally within time that scales quadratically in the number of state and action variables. [sent-53, score-0.457]
43 We also establish efficiency and asymptotic performance guarantees that apply to agnostic reinforcement learning, where Q⇤ does not necessarily lie in Q. [sent-54, score-0.544]
44 In particular, we consider the case where Q is the span of pre-specified indicator functions over disjoint sets. [sent-55, score-0.065]
45 Our results here add to the literature on agnostic reinforcement learning with such a hypothesis class [21, 25, 7, 26]. [sent-56, score-0.541]
46 Prior work in 2 this area has produced interesting algorithms and insights, as well as bounds on performance loss associated with potential limits of convergence, but no convergence or efficiency guarantees. [sent-57, score-0.036]
47 2 Reinforcement Learning in Deterministic Systems In this paper, we consider an episodic reinforcement learning (RL) problem in which an agent repeatedly interacts with a discrete-time finite-horizon deterministic system, and refer to each interaction as an episode. [sent-58, score-0.571]
48 The system is identified by a sextuple M = (S, A, H, F, R, S), where S is the state space, A is the action space, H is the horizon, F is a system function, R is a reward function and S is a sequence of states. [sent-59, score-0.541]
49 If action a 2 A is selected while the system is in state x 2 S at period t = 0, 1, · · · , H 1, a reward of Rt (x, a) is realized; furthermore, if t < H 1, the state transitions to Ft (x, a). [sent-60, score-0.72]
50 Each episode terminates at period H 1, and then a new episode begins. [sent-61, score-0.75]
51 The initial state of episode j is the jth element of S. [sent-62, score-0.481]
52 To represent the history of actions and observations over multiple episodes, we will often index variables by both episode and period. [sent-63, score-0.387]
53 For example, xj,t and aj,t denote the state and action at period t of episode j, where j = 0, 1, · · · and t = 0, 1, · · · , H 1. [sent-64, score-0.75]
54 To count the total number of steps since the agent started learning, we say period t in episode j is time jH + t. [sent-65, score-0.499]
55 , µH 1 ) is a sequence of functions, each mapping S to A. [sent-69, score-0.044]
56 PH 1 For each policy µ, define a value function Vtµ (x) = ⌧ =t R⌧ (x⌧ , a⌧ ), where xt = x, x⌧ +1 = F⌧ (x⌧ , a⌧ ), and a⌧ = µ⌧ (x⌧ ). [sent-70, score-0.063]
57 A ⇤ policy µ⇤ is said to be optimal if V µ = V ⇤ . [sent-72, score-0.063]
58 Note that this restriction incurs no loss of generality when the action space is finite. [sent-74, score-0.203]
59 Then, a policy µ⇤ is optimal if H µ⇤ (x) 2 arg maxa2A Q⇤ (x, a) for all (x, t). [sent-76, score-0.063]
60 In each episode, the algorithm realizes reward R(j) = t=0 Rt (xj,t , aj,t ). [sent-78, score-0.12]
61 Note that R(j) V0⇤ (xj,0 ) for each jth episode. [sent-79, score-0.062]
62 One way to quantify performance of a reinforcement learning algorithm is in terms of the number of episodes JL for which R(j) < V0⇤ (xj,0 ) ✏, where ✏ 0 is a pre-specified performance loss threshold. [sent-80, score-0.468]
63 If the reward function R is bounded, with |Rt (x, a)| R for all (x, a, t), then this also implies a bound on regret over episodes exPbT /Hc 1 ⇤ perienced prior to time T , defined by Regret(T ) = (V0 (xj,0 ) R(j) ). [sent-81, score-0.482]
64 Specifically, it takes as input the state space S, the action space A, the horizon H, and a hypothesis class Q of candidates for Q⇤ . [sent-84, score-0.557]
65 The algorithm maintains a sequence of subsets of Q and a sequence of scalar “upper bounds”, which summarize constraints that past experience suggests for ruling out hypotheses. [sent-85, score-0.213]
66 Each constraint in this sequence is specified by a state x 2 S, an action a 2 A, a period t = 0, . [sent-86, score-0.563]
67 , C|C| ) of such constraints and upper bounds U = (U1 , . [sent-94, score-0.098]
68 , U|C| ), a set QC is defined constructively by Algorithm 1. [sent-97, score-0.037]
69 Note that if the constraints do not conflict then QC = C1 \ · · · \ C|C| . [sent-98, score-0.061]
70 When constraints do conflict, priority is assigned first based on upper bound, with smaller upper bound preferred, and then, in the event of ties in upper bound, based on position in the sequence, with more recent experience preferred. [sent-99, score-0.27]
71 then QC QC \ C ⌧ end if end for if {u0 2 U : u0 > u} = ? [sent-101, score-0.082]
72 then return QC end if u min{u0 2 U : u0 > u} end while OCP, presented below as Algorithm 2, at each time t computes for the current state xj,t and each action a the greatest state-action value Qt (xj,t , a) among functions in QC and selects an action that attains the maximum. [sent-102, score-0.699]
73 In other words, an action is chosen based on the most optimistic feasible outcome subject to constraints. [sent-103, score-0.334]
74 The subsequent reward and state transition give rise to a new constraint that is used to update C. [sent-104, score-0.325]
75 Note that the update of C is postponed until one episode is completed. [sent-105, score-0.345]
76 In the agnostic case, where Q⇤ may not lie in Q, new constraints can be inconsistent with previous constraints, in which case selected previous constraints are relaxed as determined by Algorithm 1. [sent-108, score-0.235]
77 Let us briefly discuss several contexts of practical relevance and/or theoretical interest in which OCP can be applied. [sent-109, score-0.138]
78 With finite state and action spaces, Q⇤ can be represented as a vector, and without special prior knowledge, it is natural to let Q = <|S|·|A|·H . [sent-111, score-0.348]
79 Consider the aforementioned example, but suppose that we have prior knowledge that Q⇤ lies in a particular polytope. [sent-113, score-0.142]
wordName wordTfidf (topN-words)
[('ocp', 0.406), ('episode', 0.311), ('qc', 0.31), ('reinforcement', 0.293), ('action', 0.203), ('episodes', 0.175), ('dime', 0.159), ('regret', 0.15), ('qt', 0.14), ('kwik', 0.135), ('optimistic', 0.131), ('period', 0.128), ('rt', 0.123), ('eluder', 0.119), ('hypothesis', 0.108), ('state', 0.108), ('exploration', 0.103), ('horizon', 0.099), ('rasa', 0.09), ('tabula', 0.09), ('reward', 0.086), ('establish', 0.085), ('constraint', 0.08), ('actions', 0.076), ('propagation', 0.076), ('ciency', 0.075), ('vt', 0.075), ('deterministic', 0.074), ('agnostic', 0.068), ('aforementioned', 0.066), ('synthesize', 0.066), ('experience', 0.064), ('contexts', 0.063), ('policy', 0.063), ('jth', 0.062), ('constraints', 0.061), ('episodic', 0.06), ('gracefully', 0.06), ('ict', 0.06), ('agent', 0.06), ('selects', 0.059), ('rewards', 0.055), ('decisions', 0.053), ('guarantees', 0.053), ('exponentially', 0.051), ('interacts', 0.051), ('transition', 0.051), ('system', 0.05), ('though', 0.05), ('grows', 0.049), ('lie', 0.045), ('factored', 0.045), ('therein', 0.045), ('sequence', 0.044), ('couple', 0.044), ('tted', 0.044), ('attains', 0.044), ('relevance', 0.041), ('end', 0.041), ('wen', 0.04), ('zhengwen', 0.04), ('import', 0.04), ('appended', 0.04), ('ble', 0.04), ('jh', 0.04), ('spaces', 0.039), ('lies', 0.039), ('class', 0.039), ('ft', 0.039), ('quadratic', 0.038), ('relate', 0.038), ('upper', 0.037), ('prior', 0.037), ('bvr', 0.037), ('weaken', 0.037), ('constructively', 0.037), ('nitehorizon', 0.037), ('sarsa', 0.037), ('transitions', 0.037), ('grow', 0.037), ('body', 0.037), ('planning', 0.037), ('interesting', 0.036), ('generalization', 0.035), ('enormous', 0.034), ('propagates', 0.034), ('postponed', 0.034), ('realizes', 0.034), ('horizons', 0.034), ('bound', 0.034), ('practical', 0.034), ('literature', 0.033), ('span', 0.033), ('repeatedly', 0.033), ('provably', 0.033), ('stanford', 0.033), ('focuses', 0.033), ('lder', 0.033), ('ph', 0.033), ('disjoint', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
2 0.27406028 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
3 0.23943385 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
4 0.23869917 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
Author: Alexander Zimin, Gergely Neu
Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1
5 0.21412706 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
7 0.16283727 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
8 0.1536918 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
9 0.14203168 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
10 0.13716681 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
11 0.13312626 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
12 0.13273847 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
13 0.12130663 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
14 0.12110037 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
15 0.11948133 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
16 0.11881201 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
17 0.11537497 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
18 0.11397549 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
19 0.10889183 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
20 0.10858692 230 nips-2013-Online Learning with Costly Features and Labels
topicId topicWeight
[(0, 0.215), (1, -0.29), (2, 0.032), (3, -0.037), (4, -0.017), (5, -0.093), (6, 0.076), (7, -0.065), (8, -0.078), (9, -0.017), (10, 0.009), (11, 0.045), (12, 0.032), (13, -0.042), (14, -0.005), (15, 0.048), (16, 0.055), (17, -0.038), (18, 0.045), (19, -0.113), (20, -0.1), (21, 0.041), (22, -0.015), (23, 0.026), (24, 0.074), (25, 0.049), (26, -0.095), (27, -0.145), (28, -0.029), (29, 0.055), (30, 0.002), (31, -0.003), (32, -0.049), (33, 0.125), (34, -0.117), (35, -0.073), (36, -0.091), (37, 0.036), (38, 0.03), (39, -0.036), (40, 0.032), (41, -0.022), (42, -0.025), (43, 0.066), (44, 0.043), (45, 0.02), (46, -0.056), (47, -0.036), (48, -0.052), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.95203757 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
2 0.86021596 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
Author: Ian Osband, Dan Russo, Benjamin Van Roy
Abstract: Most provably-efficient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds. 1
3 0.78028661 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet
Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.
4 0.77557868 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
5 0.7615782 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
6 0.73934406 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
7 0.71973783 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
8 0.66126209 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
9 0.64443797 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
10 0.63361168 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models
11 0.62591189 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
12 0.54886073 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
13 0.54228473 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
14 0.52481979 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
15 0.5085488 347 nips-2013-Variational Planning for Graph-based MDPs
16 0.50417292 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
17 0.5009371 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
18 0.48808873 230 nips-2013-Online Learning with Costly Features and Labels
19 0.48748761 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
20 0.47334722 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
topicId topicWeight
[(1, 0.017), (13, 0.172), (16, 0.028), (33, 0.098), (34, 0.108), (41, 0.053), (49, 0.04), (56, 0.271), (70, 0.023), (85, 0.043), (89, 0.024), (93, 0.028), (95, 0.014)]
simIndex simValue paperId paperTitle
1 0.91812474 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions
Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes
Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that refine and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to influence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1
same-paper 2 0.91702366 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
3 0.8706798 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan
Abstract: Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1
4 0.86478472 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
Author: Josh S. Merel, Roy Fox, Tony Jebara, Liam Paninski
Abstract: In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user’s neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modified decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.
5 0.85992903 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
Author: Jing Xiang, Seyoung Kim
Abstract: We address the problem of learning a sparse Bayesian network structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the first stage and then searches for a network structure satisfying the DAG constraint in the second stage. Although this approach is effective in a lowdimensional setting, it is difficult to ensure that the correct network structure is not pruned in the first stage in a high-dimensional setting. In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efficiency of the well-known exact methods based on dynamic programming. We also present a heuristic scheme that further improves the efficiency of A* lasso without significantly compromising the quality of solutions. We demonstrate our approach on data simulated from benchmark Bayesian networks and real data. 1
6 0.85750353 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
7 0.85531354 269 nips-2013-Regression-tree Tuning in a Streaming Setting
8 0.85482222 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
9 0.85041404 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
10 0.85038531 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
11 0.84915888 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
12 0.84605604 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
13 0.83676046 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
14 0.83537209 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
15 0.83314013 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
16 0.82893008 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
17 0.82693946 249 nips-2013-Polar Operators for Structured Sparse Estimation
18 0.82680005 230 nips-2013-Online Learning with Costly Features and Labels
19 0.8259384 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
20 0.82548821 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling