nips nips2012 nips2012-259 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ronald Ortner, Daniil Ryabko
Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1
Reference: text
sentIndex sentText sentNum sentScore
1 net Abstract We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. [sent-4, score-0.77]
2 The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. [sent-5, score-0.353]
3 Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. [sent-6, score-0.613]
4 1 Introduction Real world problems usually demand continuous state or action spaces, and one of the challenges for reinforcement learning is to deal with such continuous domains. [sent-7, score-0.496]
5 Often such similarities can be formalized as Lipschitz or more generally H¨ lder continuity of reward and transition functions. [sent-9, score-0.472]
6 o The simplest continuous reinforcement learning problem is the 1-dimensional continuum-armed bandit, where the learner has to choose arms from a bounded interval. [sent-10, score-0.265]
7 Bounds on the regret with respect to an optimal policy under the assumption that the reward function is H¨ lder continuous have o been given in [15, 4]. [sent-11, score-0.751]
8 That way, the regret suffered by the algorithm consists of the loss by aggregation (which can be bounded using H¨ lder continuity) plus the regret the algorithm incurs in the discretized seto ting. [sent-13, score-0.878]
9 While the continuous bandit case has been investigated in detail, in the general case of continuous state Markov decision processes (MDPs) a lot of work is confined to rather particular settings, primarily with respect to the considered transition model. [sent-15, score-0.495]
10 In the simplest case, the transition function is considered to be deterministic as in [19], and mistake bounds for the respective discounted setting have been derived in [6]. [sent-16, score-0.302]
11 Another common assumption is that transition functions are linear functions of state and action plus some noise. [sent-17, score-0.348]
12 For such settings sample complexity bounds have √ ˜ been given in [23, 7], while O( T ) bounds for the regret after T steps are shown in [1]. [sent-18, score-0.485]
13 However, there is also some research considering more general transition dynamics under the assumption that close states behave similarly, as will be considered here. [sent-19, score-0.198]
14 Thus, [13] considers PAC-learning for continuous reinforcement learning in metric state spaces, when generative sampling is possible. [sent-21, score-0.32]
15 Here we suggest a learning algorithm for undiscounted reinforcement learning in continuous state space. [sent-25, score-0.384]
16 ˜ For our algorithm we derive regret bounds of O(T (2+α)/(2+2α) ) for MDPs with 1-dimensional state space and H¨ lder-continuous rewards and transition probabilities with parameter α. [sent-28, score-0.864]
17 These o bounds also straightforwardly generalize to dimension d where the regret is bounded by ˜ O(T (2d+α)/(2d+2α) ). [sent-29, score-0.452]
18 Thus, in particular, if rewards and transition probabilities are Lipschitz, the ˜ ˜ regret is bounded by O(T (2d+1)/(2d+2)) ) in dimension d and O(T 3/4 ) in dimension 1. [sent-30, score-0.71]
19 As far as we know, these are the first regret bounds for a general undiscounted continuous reinforcement learning setting. [sent-32, score-0.673]
20 Given is a Markov decision process (MDP) M with state space S = [0, 1]d and finite action space A. [sent-34, score-0.196]
21 The random rewards in state s under action a are assumed to be bounded in [0, 1] with mean r(s, a). [sent-38, score-0.391]
22 The transition probability distribution in state s under action a is denoted by p(·|s, a). [sent-39, score-0.348]
23 We will make the natural assumption that rewards and transition probabilities are similar in close states. [sent-40, score-0.381]
24 More precisely, we assume that rewards and transition probabilities are H¨ lder continuous. [sent-41, score-0.528]
25 We also assume existence of an optimal policy π ∗ : S → A which gives optimal average reward ρ∗ = ρ∗ (M ) on M independent of the initial state. [sent-47, score-0.24]
26 Intuitively, the bias is the difference in accumulated rewards when starting in a different state. [sent-55, score-0.256]
27 Under Assumptions 1 and 2, the bias of the optimal policy is bounded. [sent-62, score-0.194]
28 Consequently, it makes sense to define the bias span H(M ) of a continuous state MDP M satisfying Assumptions 1 and 2 to be H(M ) := sups λπ∗ (s) − inf s λπ∗ (s). [sent-63, score-0.3]
29 Note that since inf s λπ∗ (s) ≤ 0 by definition of the bias, the bias function λπ∗ is upper bounded by H(M ). [sent-64, score-0.154]
30 We are interested in algorithms which can compete with the optimal policy π ∗ and measure their T performance by the regret (after T steps) defined as T ρ∗ (M ) − t=1 rt , where rt is the random reward obtained by the algorithm at step t. [sent-65, score-0.575]
31 Indeed, within T steps no canonical or even bias optimal optimal policy (cf. [sent-66, score-0.194]
32 It maintains a set of plausible MDPs M and 2 Algorithm 1 The UCCR L algorithm Input: State space S = [0, 1], action space A, confidence parameter δ > 0, aggregation parameter n ∈ N, upper bound H on the bias span, Lipschitz parameters L, α. [sent-69, score-0.357]
33 n Set t := 1, and observe the initial state s1 and interval I(s1 ). [sent-74, score-0.153]
34 do Let Nk (Ij , a) be the number of times action a has been chosen in a state ∈ Ij prior to episode k, and vk (Ij , a) the respective counts in episode k. [sent-78, score-0.615]
35 Initialize episode k: Set the start time of episode k, tk := t. [sent-79, score-0.247]
36 Compute estimates rk (s, a) and pagg (Ii |s, a) for rewards and transition probabilities, using ˆ ˆk all samples from states in the same interval I(s), respectively. [sent-80, score-0.902]
37 ˜ (4) Execute policy πk : ˜ while vk (I(st ), πk (st )) < max{1, Nk (I(st ), πk (st ))} do ˜ ˜ Choose action at = πk (st ), obtain reward rt , and observe next state st+1 . [sent-83, score-0.67]
38 end while end for ˜ ˜ chooses optimistically an MDP M ∈ M and a policy π such that the average reward ρπ (M ) is max˜ ˜ imized, cf. [sent-85, score-0.24]
39 Whereas for UCRL2 and REGAL the set of plausible MDPs is defined by confidence intervals for rewards and transition probabilities for each individual state-action pair, for UCCR L we assume an MDP to be plausible if its aggregated rewards and transition probabilities are within a certain range. [sent-87, score-0.983]
40 This range is defined by the aggregation error (determined by the assumed H¨ lder o continuity) and respective confidence intervals, cf. [sent-88, score-0.271]
41 Correspondingly, the estimates for rewards and transition probabilities for some state action-pair (s, a) are calculated from all sampled values of action a in states close to s. [sent-90, score-0.623]
42 1 More precisely, for the aggregation UCCR L partitions the state space into intervals I1 := 0, n , k Ik := k−1 , n for k = 2, 3, . [sent-91, score-0.253]
43 The corresponding aggregated transition probabilities are n defined by pagg (Ij |s, a) := p(ds |s, a). [sent-95, score-0.701]
44 (5) Ij Generally, for a (transition) probability distribution p(·) over S we write pagg (·) for the aggregated probability distribution with respect to {I1 , I2 . [sent-96, score-0.473]
45 , In }, estimates r(s, a) and pagg (·|s, a) are calculated from all samples of action a in ˆ ˆ states in I(s), the interval Ij containing s. [sent-103, score-0.632]
46 ) As UCRL2 and REGAL, UCCR L proceeds in episodes in which the chosen policy remains fixed. [sent-105, score-0.253]
47 Episodes are terminated when the number of times an action has been sampled from some interval Ij has been doubled. [sent-106, score-0.155]
48 3 Since all states in the same interval Ij have the same confidence intervals, finding the optimal ˜ ˜ ˜ agg pair Mk , πk in (4) is equivalent to finding the respective optimistic discretized MDP Mk and agg ˜ agg . [sent-108, score-0.935]
49 Then πk can be set to be the extension of π agg to S, that is, an optimal policy πk on Mk ˜ ˜ ˜k πk (s) := πk (I(s)) for all s. [sent-109, score-0.349]
50 However, due to the constraint on the bias even in this finite case ˜ ˜ agg ˜ agg efficient computation of Mk and πk is still an open problem. [sent-110, score-0.55]
51 C algo˜ agg rithm [5] selects optimistic MDP and optimal policy in the same way as UCCR L. [sent-112, score-0.384]
52 While the algorithm presented here is the first modification of UCRL2 to continuous reinforcement learning problems, there are similar adaptations to online aggregation [21] and learning in finite state MDPs with some additional similarity structure known to the learner [22]. [sent-113, score-0.425]
53 Let M be an MDP with continuous state space [0, 1], A actions, rewards and transition probabilities satisfying Assumptions 1 and 2, and bias span upper bounded by H. [sent-116, score-0.755]
54 Then with probability 1 − δ, the regret of UCCR L (run with input parameters n and H) after T steps is upper bounded by const · nH Therefore, setting n = T 1/(2+2α) AT log + const · HLn−α T. [sent-117, score-0.474]
55 T δ (6) gives regret upper bounded by const · HL T δ A log · T (2+α)/(2+2α) . [sent-118, score-0.437]
56 With no known upper bound on the bias span, guessing H by log T one still obtains an upper bound ˜ on the regret of O(T (2+α)/(2+2α) ). [sent-119, score-0.546]
57 Intuitively, the second term in the regret bound of (6) is the discretization error, while the first term corresponds to the regret on the discretized MDP. [sent-120, score-0.722]
58 Then choosing n = T 1/(2d+2α) bounds the regret by ˜ O(T (2d+α)/(2d+2α) ). [sent-124, score-0.386]
59 If these parameters are not known, then one can still obtain o sublinear regret bounds albeit with worse dependence on T . [sent-132, score-0.386]
60 Then one selects the model with the highest reward and uses it for a period of τ0 time steps (exploitation), while checking that its average reward stays within (6) of what was obtained in the exploitation phase. [sent-136, score-0.252]
61 If the average reward does not pass this test, then the model with the second-best average reward is selected, and so on. [sent-137, score-0.252]
62 Optimizing over the parameters τi and τi as done in [17], and increasing the number J of considered parameter values, one can obtain regret bounds ˜ ˜ of O(T (2+2α)/(2+3α) ), or O(T 4/5 ) in the Lipschitz case. [sent-140, score-0.386]
63 Recently, the results of [17] have been improved [18], and it seems that similar analysis gives improved regret bounds for the case of unknown H¨ lder o parameters as well. [sent-143, score-0.533]
64 The following is a complementing lower bound on the regret for continuous state reinforcement learning. [sent-144, score-0.632]
65 For any A, H > 1 and any reinforcement learning algorithm there is a continuous state reinforcement learning problem with A actions and bias span H satisfying Assumption 1 such √ that the algorithm suffers regret of Ω( HAT ). [sent-146, score-0.912]
66 Consider the following reinforcement learning problem with state space [0, 1]. [sent-148, score-0.243]
67 The state space is partitioned into n intervals Ij of equal size. [sent-149, score-0.18]
68 The transition probabilities for each action a are on each of the intervals Ij concentrated and equally distributed on the same interval Ij . [sent-150, score-0.466]
69 The rewards on each interval Ij are also constant for each a and are chosen as in the lower bounds for a multi-armed bandit problem [3] √ nA arms. [sent-151, score-0.4]
70 That is, giving only one arm slightly higher reward, with it is known [3] that regret of Ω( nAT ) can be forced upon any algorithm on the respective bandit problem. [sent-152, score-0.43]
71 Adding another action giving no reward and √ equally distributing over the whole state space, the bias span of the problem is n and the regret Ω( HAT ). [sent-153, score-0.735]
72 However, the transition probabilities are piecewise constant (and hence Lipschitz) and known to the learner. [sent-156, score-0.228]
73 Actually, it is straightforward to deal with piecewise H¨ lder continuous rewards and o transition probabilities where the finitely many points of discontinuity are known to the learner. [sent-157, score-0.605]
74 If one makes sure that the intervals of the discretized state space do not contain any discontinuities, it is easy to adapt UCCR L and Theorem 4 accordingly. [sent-158, score-0.222]
75 The bounds of Theorems 4 and 8 cannot be directly compared to bounds for the continuous-armed bandit problem [15, 4, 16, 8], because the latter is no special case of learning MDPs with continuous state space (and rather corresponds to a continuous action space). [sent-160, score-0.64]
76 Thus, in particular one cannot freely sample an arbitrary state of the state space as assumed in continuous-armed bandits. [sent-161, score-0.194]
77 5 Proof of Theorem 4 For the proof of the main theorem we adapt the proof of the regret bounds for finite MDPs in [11] and [5]. [sent-162, score-0.386]
78 Some of the necessary adaptations made are similar to techniques used for showing regret bounds for other modifications of the original UCRL2 algorithm [21, 22], which however only considered finite-state MDPs. [sent-164, score-0.418]
79 1 Splitting into Episodes Let vk (s, a) be the number of times action a has been chosen in episode k when being in state s, and denote the total number of episodes by m. [sent-166, score-0.624]
80 Then setting ∆k := s,a vk (s, a)(ρ∗ − r(s, a)), with δ probability at least 1 − 12T 5/4 the regret of UCCR L after T steps is upper bounded by (cf. [sent-167, score-0.571]
81 (7) Failing Confidence Intervals Next, we consider the regret incurred when the true MDP M is not contained in the set of plausible MDPs Mk . [sent-171, score-0.335]
82 Thus, fix a state-action pair (s, a), and recall that r(s, a) and pagg (·|s, a) are the ˆ ˆ estimates for rewards and transition probabilities calculated from all samples of state-action pairs contained in the same interval I(s). [sent-172, score-0.868]
83 Now assume that at step t there have been N > 0 samples of action a in states in I(s) and that in the i-th sample a transition from state si ∈ I(s) to state si has been observed (i = 1, . [sent-173, score-0.581]
84 60nAt7 (8) Concerning the transition probabilities, we have for a suitable x ∈ {−1, 1}n n pagg (·|s, a) − E[ˆagg (·|s, a)] ˆ p pagg (Ij |s, a) − E[ˆagg (Ij |s, a)] ˆ p = 1 j=1 n pagg (Ij |s, a) − E[ˆagg (Ij |s, a)] x(Ij ) ˆ p = j=1 N = 1 N x(I(si )) − p(ds |si , a) · x(I(s )) . [sent-179, score-1.445]
85 56nN log 2At ≤ n Pr i=1 Xi ≥ δ 2 20nAt7 A union bound over all sequences x ∈ {−1, 1}n then yields from (9) that δ ˆ p Pr pagg (·|s, a) − E[ˆagg (·|s, a)] ≥ 56n log 2At . [sent-183, score-0.534]
86 (10) ≤ N δ 20nAt7 1 Another union bound over all t possible values for N , all n intervals and all actions shows that the δ confidence intervals in (8) and (10) hold at time t with probability at least 1 − 15t6 for the actual counts N (I(s), a) and all state-action pairs (s, a). [sent-184, score-0.224]
87 Similarly, r E[ˆagg (·|s, a)] − pagg (·|s, a) 1 < Ln−α . [sent-188, score-0.431]
88 Together with (8) and (10) this shows that with probap δ bility at least 1 − 15t6 for all state-action pairs (s, a) pagg (·|s, a) − pagg (·|s, a) ˆ < Ln−α + 7 log(2nAt/δ) 2 max{1,N (I(s),a)} < r(s, a) − r(s, a) ˆ Ln−α + 56n log(2At/δ) max{1,N (I(s),a)} 1 , . [sent-189, score-0.894]
89 Since max{1, Nk (Ij , a)} ≤ tk ≤ T , setting τk := tk+1 − tk to be the length of episode k we have vk (s, πk (s)) ρ∗ − rk (s, πk (s)) ˜ ˜k ˜ ˜ ≤ ∆k s n + 2Ln−α τk + 14 log vk (Ij , a) . [sent-196, score-0.78]
90 ˜ ˜ ˜ vk (s, πk (s)) ˜ j=1 (16) Ij The True Transition Functions Now pagg (·|s, a) − pagg (·|s, a) 1 ≤ pagg (·|s, a) − pagg (·|s, a) 1 + pagg (·|s, a) − pagg (·|s, a) 1 ˜k ˜k ˆk ˆk ˜ can be bounded by (3) and (12), because we assume Mk , M ∈ Mk . [sent-200, score-2.838]
91 This δ yields that with probability at least 1 − 12T 5/4 m m ∆k 1M ∈Mk ≤ 2HLn−α T + 4H 2AT δ 14n log k=1 j=1 a∈A k=1 +H n · 5 2T log 8T δ + 2Ln−α T + 8T nA m n vk (Ij , a) max{1, Nk (Ij , a)} + HnA log2 vk (Ij , a) . [sent-211, score-0.498]
92 3 of [11], one can show that n √ √ vk (Ij , a) ≤ 2 + 1 nAT , max{1, Nk (Ij , a)} j=1 a∈A k and we get from (19) after some simplifications that with probability ≥ 1 − δ 12T 5/4 m 5 2T ∆k 1M ∈Mk ≤ H log 8T δ + HnA log2 8T nA k=1 + (4H + 1) 14n log 2AT δ √ √ 2+1 nAT + 2(H + 1)Ln−α T . [sent-215, score-0.288]
93 6 Outlook We think that a generalization of our results to continuous action space should not pose any major problems. [sent-219, score-0.176]
94 The assumption of H¨ lder continuity is an obvious, yet not the only possible assumption one can o make about the transition probabilities and reward functions. [sent-221, score-0.548]
95 A more general problem is to assume a set F of functions, find a way to measure the “size” of F, and derive regret bounds depending on this size of F. [sent-222, score-0.386]
96 REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. [sent-245, score-0.171]
97 Adaptive-resolution reinforcement learning with polynomial exploration in deterministic domains. [sent-249, score-0.197]
98 Optimal regret bounds for selecting the state representation in reinforcement learning. [sent-306, score-0.629]
99 Adaptive aggregation for reinforcement learning in average reward Markov decision processes. [sent-316, score-0.345]
100 Tree based discretization for continuous state space reinforcement learning. [sent-336, score-0.401]
wordName wordTfidf (topN-words)
[('pagg', 0.431), ('regret', 0.287), ('uccr', 0.274), ('mk', 0.265), ('agg', 0.235), ('vk', 0.21), ('ij', 0.186), ('rewards', 0.153), ('transition', 0.152), ('lder', 0.147), ('reinforcement', 0.146), ('episodes', 0.139), ('ds', 0.13), ('reward', 0.126), ('policy', 0.114), ('action', 0.099), ('bounds', 0.099), ('ortner', 0.098), ('state', 0.097), ('mdps', 0.096), ('bandit', 0.092), ('tk', 0.089), ('st', 0.086), ('intervals', 0.083), ('discretization', 0.081), ('daniil', 0.08), ('bias', 0.08), ('episode', 0.079), ('hna', 0.078), ('regal', 0.078), ('stk', 0.078), ('continuous', 0.077), ('mdp', 0.077), ('probabilities', 0.076), ('ronald', 0.074), ('aggregation', 0.073), ('ln', 0.069), ('nk', 0.067), ('rk', 0.064), ('undiscounted', 0.064), ('interval', 0.056), ('auer', 0.055), ('optimism', 0.052), ('exploration', 0.051), ('respective', 0.051), ('nat', 0.051), ('peter', 0.05), ('na', 0.049), ('plausible', 0.048), ('continuity', 0.047), ('span', 0.046), ('states', 0.046), ('dence', 0.045), ('si', 0.045), ('discretized', 0.042), ('bounded', 0.042), ('aggregated', 0.042), ('modelselection', 0.039), ('log', 0.039), ('lipschitz', 0.039), ('remark', 0.038), ('const', 0.037), ('poisson', 0.037), ('optimistic', 0.035), ('csaba', 0.035), ('maxim', 0.035), ('simo', 0.035), ('actions', 0.033), ('munos', 0.033), ('michael', 0.032), ('upper', 0.032), ('bility', 0.032), ('leoben', 0.032), ('bernard', 0.032), ('ryabko', 0.032), ('adaptations', 0.032), ('jean', 0.03), ('hat', 0.03), ('maillard', 0.028), ('max', 0.028), ('szepesv', 0.028), ('pk', 0.028), ('hern', 0.027), ('pr', 0.027), ('horizon', 0.027), ('mi', 0.027), ('guessing', 0.026), ('appendix', 0.026), ('communicating', 0.025), ('bandits', 0.025), ('bound', 0.025), ('nicol', 0.025), ('nicholas', 0.025), ('chapter', 0.024), ('straightforwardly', 0.024), ('martingale', 0.024), ('rt', 0.024), ('accumulated', 0.023), ('con', 0.023), ('kearns', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
Author: Ronald Ortner, Daniil Ryabko
Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
3 0.19861935 295 nips-2012-Risk-Aversion in Multi-armed Bandits
Author: Amir Sani, Alessandro Lazaric, Rémi Munos
Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1
4 0.16611272 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer
Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1
5 0.16383962 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
6 0.16198003 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
7 0.14330155 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
8 0.14198148 38 nips-2012-Algorithms for Learning Markov Field Policies
9 0.14009349 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
10 0.1340386 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
11 0.13086568 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
12 0.12936766 344 nips-2012-Timely Object Recognition
13 0.12702557 160 nips-2012-Imitation Learning by Coaching
14 0.12370677 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
15 0.11940207 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
16 0.11855543 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
17 0.11380376 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
18 0.10442943 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
19 0.10114289 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
20 0.095137432 234 nips-2012-Multiresolution analysis on the symmetric group
topicId topicWeight
[(0, 0.186), (1, -0.338), (2, 0.05), (3, 0.007), (4, -0.007), (5, 0.047), (6, 0.002), (7, 0.012), (8, -0.035), (9, 0.019), (10, 0.116), (11, 0.06), (12, -0.111), (13, -0.087), (14, -0.119), (15, 0.038), (16, -0.02), (17, 0.034), (18, -0.049), (19, -0.037), (20, 0.061), (21, -0.072), (22, -0.011), (23, -0.002), (24, 0.025), (25, 0.042), (26, 0.005), (27, -0.036), (28, -0.033), (29, 0.036), (30, -0.041), (31, 0.01), (32, 0.055), (33, -0.006), (34, 0.044), (35, -0.05), (36, -0.033), (37, 0.076), (38, -0.048), (39, 0.023), (40, -0.002), (41, 0.02), (42, 0.023), (43, -0.06), (44, 0.007), (45, -0.019), (46, 0.005), (47, -0.016), (48, -0.008), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.95733273 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
Author: Ronald Ortner, Daniil Ryabko
Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1
2 0.6928224 295 nips-2012-Risk-Aversion in Multi-armed Bandits
Author: Amir Sani, Alessandro Lazaric, Rémi Munos
Abstract: Stochastic multi–armed bandits solve the Exploration–Exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk–aversion where the objective is to compete against the arm with the best risk–return trade–off. This setting proves to be more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we define two algorithms, investigate their theoretical guarantees, and report preliminary empirical results. 1
Author: Bruno Scherrer, Boris Lesner
Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1
4 0.65979326 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer
Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1
5 0.64943802 348 nips-2012-Tractable Objectives for Robust Policy Optimization
Author: Katherine Chen, Michael Bowling
Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1
6 0.64342844 61 nips-2012-Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
7 0.64143413 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning
8 0.63409549 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
9 0.62375844 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
10 0.59270567 38 nips-2012-Algorithms for Learning Markov Field Policies
11 0.59198898 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
12 0.58993399 160 nips-2012-Imitation Learning by Coaching
13 0.58888 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
14 0.5885464 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
15 0.58775514 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
16 0.58104867 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization
17 0.56679112 364 nips-2012-Weighted Likelihood Policy Search with Model Selection
18 0.56470829 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)
19 0.56154853 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
20 0.55728269 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
topicId topicWeight
[(0, 0.019), (21, 0.042), (31, 0.212), (36, 0.043), (38, 0.12), (42, 0.021), (53, 0.013), (54, 0.103), (55, 0.014), (74, 0.023), (76, 0.117), (80, 0.1), (92, 0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.81080753 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
Author: Ronald Ortner, Daniil Ryabko
Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1
Author: Stefan Habenschuss, Johannes Bill, Bernhard Nessler
Abstract: Recent spiking network models of Bayesian inference and unsupervised learning frequently assume either inputs to arrive in a special format or employ complex computations in neuronal activation functions and synaptic plasticity rules. Here we show in a rigorous mathematical treatment how homeostatic processes, which have previously received little attention in this context, can overcome common theoretical limitations and facilitate the neural implementation and performance of existing models. In particular, we show that homeostatic plasticity can be understood as the enforcement of a ’balancing’ posterior constraint during probabilistic inference and learning with Expectation Maximization. We link homeostatic dynamics to the theory of variational inference, and show that nontrivial terms, which typically appear during probabilistic inference in a large class of models, drop out. We demonstrate the feasibility of our approach in a spiking WinnerTake-All architecture of Bayesian inference and learning. Finally, we sketch how the mathematical framework can be extended to richer recurrent network architectures. Altogether, our theory provides a novel perspective on the interplay of homeostatic processes and synaptic plasticity in cortical microcircuits, and points to an essential role of homeostasis during inference and learning in spiking networks. 1
3 0.78693473 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections
Author: Ping Li, Cun-hui Zhang
Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.
4 0.72595292 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar
Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1
5 0.72371984 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller
Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1
6 0.70971751 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
7 0.70860326 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
8 0.68996412 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
9 0.68926144 344 nips-2012-Timely Object Recognition
10 0.68782079 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
11 0.6809836 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs
12 0.67915076 292 nips-2012-Regularized Off-Policy TD-Learning
13 0.67887646 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
14 0.67872465 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion
15 0.6773473 38 nips-2012-Algorithms for Learning Markov Field Policies
16 0.67687643 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
17 0.67562431 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
18 0.67556196 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
19 0.67491436 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
20 0.67458928 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning