nips nips2013 nips2013-103 knowledge-graph by maker-knowledge-mining

103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Source: pdf

Author: Zheng Wen, Benjamin Van Roy

Abstract: We consider the problem of reinforcement learning over episodes of a ﬁnitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efﬁcient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efﬁciency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-speciﬁed indicator functions over disjoint sets. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the problem of reinforcement learning over episodes of a ﬁnitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efﬁcient exploration and value function generalization. [sent-3, score-1.048]

2 We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. [sent-4, score-0.525]

3 We establish further efﬁciency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-speciﬁed indicator functions over disjoint sets. [sent-5, score-0.248]

4 1 Introduction A growing body of work on efﬁcient reinforcement learning provides algorithms with guarantees on sample and computational efﬁciency [13, 6, 2, 22, 4, 9]. [sent-6, score-0.383]

5 This literature highlights the point that an effective exploration scheme is critical to the design of any efﬁcient reinforcement learning algorithm. [sent-7, score-0.429]

6 In particular, popular exploration schemes such as ✏-greedy, Boltzmann, and knowledge gradient can require learning times that grow exponentially in the number of states and/or the planning horizon. [sent-8, score-0.228]

7 The aforementioned literature focuses on tabula rasa learning; that is, algorithms aim to learn with little or no prior knowledge about transition probabilities and rewards. [sent-9, score-0.4]

8 Such algorithms require learning times that grow at least linearly with the number of states. [sent-10, score-0.037]

9 Despite the valuable insights that have been generated through their design and analysis, these algorithms are of limited practical import because state spaces in most contexts of practical interest are enormous. [sent-11, score-0.318]

10 There is a need for algorithms that generalize from past experience in order to learn how to make effective decisions in reasonable time. [sent-12, score-0.117]

11 There has been much work on reinforcement learning algorithms that generalize (see, e. [sent-13, score-0.293]

12 Most of these algorithms do not come with statistical or computational efﬁciency guarantees, though there are a few noteworthy exceptions, which we now discuss. [sent-16, score-0.05]

13 Though interesting results have been produced in this line of work, each entails quite restrictive assumptions or does not make strong guarantees. [sent-18, score-0.036]

14 An algorithm is proposed in [12] that ﬁts a factored model to observed data and makes decisions based on the ﬁtted model. [sent-20, score-0.098]

15 The authors establish a sample complexity bound that is polynomial in the number of model parameters rather than the number of states, but the algorithm is computationally intractable because of the difﬁculty of solving factored MDPs. [sent-21, score-0.164]

16 Though this result is theoretically interesting, for most model classes of interest, the ✏-covering-number is enormous since it typically grows exponentially in the number of free parameters. [sent-23, score-0.134]

17 Another recent paper [17] establishes a regret bound for an algorithm that applies to problems with continuous state spaces and H¨ lder-continuous rewards and transition kernels. [sent-24, score-0.437]

18 o Though the results represent an interesting contribution to the literature, a couple features of the regret bound weaken its practical implications. [sent-25, score-0.335]

19 First, regret grows linearly with the H¨ lder constant o of the transition kernel, which for most contexts of practical relevance grows exponentially in the number of state variables. [sent-26, score-0.629]

20 Second, the dependence on time becomes arbitrarily close to linear as the dimension of the state space grows. [sent-27, score-0.108]

21 Reinforcement learning in linear systems with quadratic cost is treated in [1]. [sent-28, score-0.038]

22 The method proposed is shown to realize regret that grows with the square root of time. [sent-29, score-0.199]

23 Here, there are efﬁciency guarantees that scale gracefully with the number of state variables, but only under sparsity and other technical assumptions. [sent-32, score-0.221]

24 The most popular approach to generalization in the applied reinforcement learning literature involves ﬁtting parameterized value functions. [sent-33, score-0.361]

25 Such approaches relate closely to supervised learning in that they learn functions from state to value, though a difference is that value is inﬂuenced by action and observed only through delayed feedback. [sent-34, score-0.399]

26 One advantage over model learning approaches is that, given a ﬁtted value function, decisions can be made without solving a potentially intractable control problem. [sent-35, score-0.053]

27 We see this as a promising direction, though there currently is a lack of theoretical results that provide attractive bounds on learning time with value function generalization. [sent-36, score-0.05]

28 A relevant paper along this research line is [15], which studies the efﬁcient reinforcement learning with value function generalization in the KWIK framework (see [16]), and reduces the efﬁcient reinforcement learning problem to the efﬁcient KWIK online regression problem. [sent-37, score-0.621]

29 Thus, though the result of [15] is interesting, it does not provide a provably efﬁcient algorithm. [sent-39, score-0.083]

30 An important challenge that remains is to couple exploration and value function generalization in a provably effective way, and in particular, to establish sample and computational efﬁciency guarantees that scale gracefully with the planning horizon and model complexity. [sent-40, score-0.549]

31 To start with a simple context, we restrict our attention to deterministic systems that evolve over ﬁnite time horizons, and we consider episodic learning, in which an agent repeatedly interacts with the same system. [sent-42, score-0.278]

32 As a solution to the problem, we propose optimistic constraint propagation (OCP), a computationally efﬁcient reinforcement learning algorithm designed to synthesize efﬁcient exploration and value function generalization. [sent-43, score-0.749]

33 We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes. [sent-44, score-0.406]

34 Here, dimE denotes the eluder dimension, which quantiﬁes complexity of the hypothesis class. [sent-45, score-0.227]

35 A corollary of this result is that regret is bounded by a function that is constant over time and linear in the problem horizon and eluder dimension. [sent-46, score-0.368]

36 To put our aforementioned result in perspective, it is useful to relate it to other lines of work. [sent-47, score-0.104]

37 Consider ﬁrst the broad area of reinforcement learning algorithms that ﬁt value functions, such as SARSA [19]. [sent-48, score-0.293]

38 Even with the most commonly used sort of hypothesis class Q, which is made up of linear combinations of ﬁxed basis functions, and even when the hypothesis class contains the true value function Q⇤ , there are no guarantees that these algorithms will efﬁciently learn to make near-optimal decisions. [sent-49, score-0.347]

39 On the other hand, our result implies that OCP attains near-optimal performance in time that scales linearly with the number of basis functions. [sent-50, score-0.044]

40 Now consider the more specialized context of a deterministic linear system with quadratic cost and a ﬁnite time horizon. [sent-51, score-0.162]

41 The analysis of [1] can be leveraged to produce regret bounds that scale exponentially in the number of state variables. [sent-52, score-0.309]

42 On the other hand, using a hypothesis space Q consisting of quadratic functions of state-action pairs, the results of this paper show that OCP behaves near optimally within time that scales quadratically in the number of state and action variables. [sent-53, score-0.457]

43 We also establish efﬁciency and asymptotic performance guarantees that apply to agnostic reinforcement learning, where Q⇤ does not necessarily lie in Q. [sent-54, score-0.544]

44 In particular, we consider the case where Q is the span of pre-speciﬁed indicator functions over disjoint sets. [sent-55, score-0.065]

45 Our results here add to the literature on agnostic reinforcement learning with such a hypothesis class [21, 25, 7, 26]. [sent-56, score-0.541]

46 Prior work in 2 this area has produced interesting algorithms and insights, as well as bounds on performance loss associated with potential limits of convergence, but no convergence or efﬁciency guarantees. [sent-57, score-0.036]

47 2 Reinforcement Learning in Deterministic Systems In this paper, we consider an episodic reinforcement learning (RL) problem in which an agent repeatedly interacts with a discrete-time ﬁnite-horizon deterministic system, and refer to each interaction as an episode. [sent-58, score-0.571]

48 The system is identiﬁed by a sextuple M = (S, A, H, F, R, S), where S is the state space, A is the action space, H is the horizon, F is a system function, R is a reward function and S is a sequence of states. [sent-59, score-0.541]

49 If action a 2 A is selected while the system is in state x 2 S at period t = 0, 1, · · · , H 1, a reward of Rt (x, a) is realized; furthermore, if t < H 1, the state transitions to Ft (x, a). [sent-60, score-0.72]

50 Each episode terminates at period H 1, and then a new episode begins. [sent-61, score-0.75]

51 The initial state of episode j is the jth element of S. [sent-62, score-0.481]

52 To represent the history of actions and observations over multiple episodes, we will often index variables by both episode and period. [sent-63, score-0.387]

53 For example, xj,t and aj,t denote the state and action at period t of episode j, where j = 0, 1, · · · and t = 0, 1, · · · , H 1. [sent-64, score-0.75]

54 To count the total number of steps since the agent started learning, we say period t in episode j is time jH + t. [sent-65, score-0.499]

55 , µH 1 ) is a sequence of functions, each mapping S to A. [sent-69, score-0.044]

56 PH 1 For each policy µ, deﬁne a value function Vtµ (x) = ⌧ =t R⌧ (x⌧ , a⌧ ), where xt = x, x⌧ +1 = F⌧ (x⌧ , a⌧ ), and a⌧ = µ⌧ (x⌧ ). [sent-70, score-0.063]

57 A ⇤ policy µ⇤ is said to be optimal if V µ = V ⇤ . [sent-72, score-0.063]

58 Note that this restriction incurs no loss of generality when the action space is ﬁnite. [sent-74, score-0.203]

59 Then, a policy µ⇤ is optimal if H µ⇤ (x) 2 arg maxa2A Q⇤ (x, a) for all (x, t). [sent-76, score-0.063]

60 In each episode, the algorithm realizes reward R(j) = t=0 Rt (xj,t , aj,t ). [sent-78, score-0.12]

61 Note that R(j)  V0⇤ (xj,0 ) for each jth episode. [sent-79, score-0.062]

62 One way to quantify performance of a reinforcement learning algorithm is in terms of the number of episodes JL for which R(j) < V0⇤ (xj,0 ) ✏, where ✏ 0 is a pre-speciﬁed performance loss threshold. [sent-80, score-0.468]

63 If the reward function R is bounded, with |Rt (x, a)|  R for all (x, a, t), then this also implies a bound on regret over episodes exPbT /Hc 1 ⇤ perienced prior to time T , deﬁned by Regret(T ) = (V0 (xj,0 ) R(j) ). [sent-81, score-0.482]

64 Speciﬁcally, it takes as input the state space S, the action space A, the horizon H, and a hypothesis class Q of candidates for Q⇤ . [sent-84, score-0.557]

65 The algorithm maintains a sequence of subsets of Q and a sequence of scalar “upper bounds”, which summarize constraints that past experience suggests for ruling out hypotheses. [sent-85, score-0.213]

66 Each constraint in this sequence is speciﬁed by a state x 2 S, an action a 2 A, a period t = 0, . [sent-86, score-0.563]

67 , C|C| ) of such constraints and upper bounds U = (U1 , . [sent-94, score-0.098]

68 , U|C| ), a set QC is deﬁned constructively by Algorithm 1. [sent-97, score-0.037]

69 Note that if the constraints do not conﬂict then QC = C1 \ · · · \ C|C| . [sent-98, score-0.061]

70 When constraints do conﬂict, priority is assigned ﬁrst based on upper bound, with smaller upper bound preferred, and then, in the event of ties in upper bound, based on position in the sequence, with more recent experience preferred. [sent-99, score-0.27]

71 then QC QC \ C ⌧ end if end for if {u0 2 U : u0 > u} = ? [sent-101, score-0.082]

72 then return QC end if u min{u0 2 U : u0 > u} end while OCP, presented below as Algorithm 2, at each time t computes for the current state xj,t and each action a the greatest state-action value Qt (xj,t , a) among functions in QC and selects an action that attains the maximum. [sent-102, score-0.699]

73 In other words, an action is chosen based on the most optimistic feasible outcome subject to constraints. [sent-103, score-0.334]

74 The subsequent reward and state transition give rise to a new constraint that is used to update C. [sent-104, score-0.325]

75 Note that the update of C is postponed until one episode is completed. [sent-105, score-0.345]

76 In the agnostic case, where Q⇤ may not lie in Q, new constraints can be inconsistent with previous constraints, in which case selected previous constraints are relaxed as determined by Algorithm 1. [sent-108, score-0.235]

77 Let us brieﬂy discuss several contexts of practical relevance and/or theoretical interest in which OCP can be applied. [sent-109, score-0.138]

78 With ﬁnite state and action spaces, Q⇤ can be represented as a vector, and without special prior knowledge, it is natural to let Q = <|S|·|A|·H . [sent-111, score-0.348]

79 Consider the aforementioned example, but suppose that we have prior knowledge that Q⇤ lies in a particular polytope. [sent-113, score-0.142]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ocp', 0.406), ('episode', 0.311), ('qc', 0.31), ('reinforcement', 0.293), ('action', 0.203), ('episodes', 0.175), ('dime', 0.159), ('regret', 0.15), ('qt', 0.14), ('kwik', 0.135), ('optimistic', 0.131), ('period', 0.128), ('rt', 0.123), ('eluder', 0.119), ('hypothesis', 0.108), ('state', 0.108), ('exploration', 0.103), ('horizon', 0.099), ('rasa', 0.09), ('tabula', 0.09), ('reward', 0.086), ('establish', 0.085), ('constraint', 0.08), ('actions', 0.076), ('propagation', 0.076), ('ciency', 0.075), ('vt', 0.075), ('deterministic', 0.074), ('agnostic', 0.068), ('aforementioned', 0.066), ('synthesize', 0.066), ('experience', 0.064), ('contexts', 0.063), ('policy', 0.063), ('jth', 0.062), ('constraints', 0.061), ('episodic', 0.06), ('gracefully', 0.06), ('ict', 0.06), ('agent', 0.06), ('selects', 0.059), ('rewards', 0.055), ('decisions', 0.053), ('guarantees', 0.053), ('exponentially', 0.051), ('interacts', 0.051), ('transition', 0.051), ('system', 0.05), ('though', 0.05), ('grows', 0.049), ('lie', 0.045), ('factored', 0.045), ('therein', 0.045), ('sequence', 0.044), ('couple', 0.044), ('tted', 0.044), ('attains', 0.044), ('relevance', 0.041), ('end', 0.041), ('wen', 0.04), ('zhengwen', 0.04), ('import', 0.04), ('appended', 0.04), ('ble', 0.04), ('jh', 0.04), ('spaces', 0.039), ('lies', 0.039), ('class', 0.039), ('ft', 0.039), ('quadratic', 0.038), ('relate', 0.038), ('upper', 0.037), ('prior', 0.037), ('bvr', 0.037), ('weaken', 0.037), ('constructively', 0.037), ('nitehorizon', 0.037), ('sarsa', 0.037), ('transitions', 0.037), ('grow', 0.037), ('body', 0.037), ('planning', 0.037), ('interesting', 0.036), ('generalization', 0.035), ('enormous', 0.034), ('propagates', 0.034), ('postponed', 0.034), ('realizes', 0.034), ('horizons', 0.034), ('bound', 0.034), ('practical', 0.034), ('literature', 0.033), ('span', 0.033), ('repeatedly', 0.033), ('provably', 0.033), ('stanford', 0.033), ('focuses', 0.033), ('lder', 0.033), ('ph', 0.033), ('disjoint', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Author: Zheng Wen, Benjamin Van Roy

2 0.27406028 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

Author: Dan Russo, Benjamin Van Roy

Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper conﬁdence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for speciﬁc model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1

3 0.23943385 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

Abstract: Most provably-eﬃcient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for eﬃcient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally eﬃcient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the ﬁrst for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL signiﬁcantly outperforms existing algorithms with similar regret bounds. 1

4 0.23869917 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

Author: Alexander Zimin, Gergely Neu

Abstract: We study the problem of online learning in ﬁnite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret deﬁned as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a ﬁnite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be signiﬁcantly improved under general assumptions. 1

5 0.21412706 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

6 0.16672313 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

7 0.16283727 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.1536918 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

9 0.14203168 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

10 0.13716681 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

11 0.13312626 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

12 0.13273847 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

13 0.12130663 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

14 0.12110037 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

15 0.11948133 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

16 0.11881201 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

17 0.11537497 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

18 0.11397549 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

19 0.10889183 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

20 0.10858692 230 nips-2013-Online Learning with Costly Features and Labels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.215), (1, -0.29), (2, 0.032), (3, -0.037), (4, -0.017), (5, -0.093), (6, 0.076), (7, -0.065), (8, -0.078), (9, -0.017), (10, 0.009), (11, 0.045), (12, 0.032), (13, -0.042), (14, -0.005), (15, 0.048), (16, 0.055), (17, -0.038), (18, 0.045), (19, -0.113), (20, -0.1), (21, 0.041), (22, -0.015), (23, 0.026), (24, 0.074), (25, 0.049), (26, -0.095), (27, -0.145), (28, -0.029), (29, 0.055), (30, 0.002), (31, -0.003), (32, -0.049), (33, 0.125), (34, -0.117), (35, -0.073), (36, -0.091), (37, 0.036), (38, 0.03), (39, -0.036), (40, 0.032), (41, -0.022), (42, -0.025), (43, 0.066), (44, 0.043), (45, 0.02), (46, -0.056), (47, -0.036), (48, -0.052), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95203757 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Author: Zheng Wen, Benjamin Van Roy

2 0.86021596 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

3 0.78028661 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.

4 0.77557868 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

Author: Dan Russo, Benjamin Van Roy

5 0.7615782 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

6 0.73934406 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

7 0.71973783 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

8 0.66126209 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

9 0.64443797 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

10 0.63361168 323 nips-2013-Synthesizing Robust Plans under Incomplete Domain Models

11 0.62591189 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

12 0.54886073 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

13 0.54228473 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

14 0.52481979 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

15 0.5085488 347 nips-2013-Variational Planning for Graph-based MDPs

16 0.50417292 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

17 0.5009371 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

18 0.48808873 230 nips-2013-Online Learning with Costly Features and Labels

19 0.48748761 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

20 0.47334722 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.017), (13, 0.172), (16, 0.028), (33, 0.098), (34, 0.108), (41, 0.053), (49, 0.04), (56, 0.271), (70, 0.023), (85, 0.043), (89, 0.024), (93, 0.028), (95, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91812474 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

Author: Rishabh K. Iyer, Stefanie Jegelka, Jeff A. Bilmes

Abstract: We investigate three related and important problems connected to machine learning: approximating a submodular function everywhere, learning a submodular function (in a PAC-like setting [28]), and constrained minimization of submodular functions. We show that the complexity of all three problems depends on the “curvature” of the submodular function, and provide lower and upper bounds that reﬁne and improve previous results [2, 6, 8, 27]. Our proof techniques are fairly generic. We either use a black-box transformation of the function (for approximation and learning), or a transformation of algorithms to use an appropriate surrogate function (for minimization). Curiously, curvature has been known to inﬂuence approximations for submodular maximization [3, 29], but its effect on minimization, approximation and learning has hitherto been open. We complete this picture, and also support our theoretical claims by empirical results. 1

same-paper 2 0.91702366 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Author: Zheng Wen, Benjamin Van Roy

3 0.8706798 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

Abstract: Maximization of submodular functions has wide applications in machine learning and artiﬁcial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efﬁcient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1

4 0.86478472 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

Author: Josh S. Merel, Roy Fox, Tony Jebara, Liam Paninski

Abstract: In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user’s neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modiﬁed decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.

5 0.85992903 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables

Author: Jing Xiang, Seyoung Kim

Abstract: We address the problem of learning a sparse Bayesian network structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the ﬁrst stage and then searches for a network structure satisfying the DAG constraint in the second stage. Although this approach is effective in a lowdimensional setting, it is difﬁcult to ensure that the correct network structure is not pruned in the ﬁrst stage in a high-dimensional setting. In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efﬁciency of the well-known exact methods based on dynamic programming. We also present a heuristic scheme that further improves the efﬁciency of A* lasso without signiﬁcantly compromising the quality of solutions. We demonstrate our approach on data simulated from benchmark Bayesian networks and real data. 1

6 0.85750353 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

7 0.85531354 269 nips-2013-Regression-tree Tuning in a Streaming Setting

8 0.85482222 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

9 0.85041404 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

10 0.85038531 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

11 0.84915888 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

12 0.84605604 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

13 0.83676046 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

14 0.83537209 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors

15 0.83314013 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

16 0.82893008 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

17 0.82693946 249 nips-2013-Polar Operators for Structured Sparse Estimation

18 0.82680005 230 nips-2013-Online Learning with Costly Features and Labels

19 0.8259384 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

20 0.82548821 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling