nips nips2012 nips2012-255 knowledge-graph by maker-knowledge-mining

255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes


Source: pdf

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. [sent-5, score-0.327]

2 Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. [sent-6, score-0.442]

3 After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. [sent-7, score-0.345]

4 Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. [sent-8, score-0.153]

5 At each iteration k, the term k accounts for a possible approximation of the Bellman operator (for AVI) or the evaluation of vπk (for API). [sent-10, score-0.089]

6 Under this assumption, it is well-known that both algorithms share the following performance bound (see [25, 11, 4] for AVI and [4] for API): Theorem 1. [sent-12, score-0.059]

7 any policy πk in G(vk−1 )) instead of the optimal policy π∗ satisfies lim sup v∗ − vπk k→∞ ∞ ≤ 2γ . [sent-15, score-1.048]

8 (1 − γ)2 2γ The constant (1−γ)2 can be very big, in particular when γ is close to 1, and consequently the above bound is commonly believed to be conservative for practical applications. [sent-16, score-0.059]

9 Indeed, the bound (and the (1−γ)2 constant) are tight for API [4, Example 6. [sent-18, score-0.106]

10 4], and we will show in Section 3 – to our knowledge, this has never been argued in the literature – that it is also tight for AVI. [sent-19, score-0.047]

11 1 Even though the theory of optimal control states that there exists a stationary policy that is optimal, the main contribution of our paper is to show that looking for a non-stationary policy (instead of a stationary one) may lead to a much better performance bound. [sent-20, score-1.364]

12 In Section 4, we will show how to deduce such a non-stationary policy from a run of AVI. [sent-21, score-0.529]

13 In Section 5, we will describe two original policy iteration variations that compute non-stationary policies. [sent-22, score-0.569]

14 For all these algorithms, we will 2γ 1 prove that we have a performance bound that can be reduced down to 1−γ . [sent-23, score-0.079]

15 This is a factor 1−γ better than the standard bound of Theorem 1, which is significant when γ is close to 1. [sent-24, score-0.059]

16 Surprisingly, this will show that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. [sent-25, score-0.153]

17 A stationary deterministic policy π : S → A maps states to actions. [sent-28, score-0.7]

18 We write rπ (s) = r(s, π(s)) and Pπ (ds |s) = P (ds |s, π(s)) for the immediate reward and the stochastic kernel associated to policy π. [sent-29, score-0.543]

19 The value vπ of a policy π is a function mapping states to the expected discounted sum of rewards received when following π from any state: for all s ∈ S, ∞ γ t rπ (st ) s0 = s, st+1 ∼ Pπ (·|st ) . [sent-30, score-0.579]

20 It is well-known that vπ can be characterized as the unique fixed point of the linear Bellman operator associated to a policy π: Tπ : v → rπ + γPπ v. [sent-32, score-0.532]

21 Similarly, the Bellman optimality operator T : v → maxπ Tπ v has as unique fixed point the optimal value v∗ = maxπ vπ . [sent-33, score-0.068]

22 a value function v if Tπ v = T v, the set of such greedy policies is written G(v). [sent-37, score-0.352]

23 Finally, a policy π∗ is optimal, with value vπ∗ = v∗ , iff π∗ ∈ G(v∗ ), or equivalently Tπ∗ v∗ = v∗ . [sent-38, score-0.527]

24 Though it is known [24, 4] that there always exists a deterministic stationary policy that is optimal, we will, in this article, consider non-stationary policies and now introduce related notations. [sent-39, score-0.964]

25 , πk of k stationary policies (this sequence will be clear in the context we describe later), and for any 1 ≤ m ≤ k, we will denote πk,m the periodic non-stationary policy that takes the first action according to πk , the second according to πk−1 , . [sent-43, score-1.107]

26 Formally, this can be written as πk,m = πk πk−1 · · · πk−m+1 πk πk−1 · · · πk−m+1 · · · It is straightforward to show that the value vπk,m of this periodic non-stationary policy πk,m is the unique fixed point of the following operator: Tk,m = Tπk Tπk−1 · · · Tπk−m+1 . [sent-47, score-0.595]

27 3 Tightness of the performance bound of Theorem 1 The bound of Theorem 1 is tight for API in the sense that there exists an MDP [4, Example 6. [sent-50, score-0.165]

28 In state 1 there is only one self-loop action with zero reward, for each state i > 1 there are two possible choices: either move to state i − 1 with zero reward or stay 2 k 1 0 2 −2 γ−γ 1−γ −2(γ + γ 2 ) −2γ 0 0 . [sent-62, score-0.256]

29 0 Figure 1: The determinisitic MDP for which the bound of Theorem 1 is tight for Value and Policy Iteration. [sent-68, score-0.106]

30 Clearly the optimal policy in all states i > 1 is to move to 1−γ i − 1 and the optimal value function v∗ is 0 in all states. [sent-70, score-0.626]

31 Starting with v0 = v∗ , we are going to show that for all iterations k ≥ 1 it is possible to have a policy πk+1 ∈ G(vk ) which moves in every state but k + 1 and thus is such that vπk+1 (k + 1) = rk+1 1−γ k+1 = −2 γ−γ 2 , which meets the bound of Theorem 1 when k tends to infinity. [sent-71, score-0.658]

32 (1−γ) To do so, we assume that the following approximation errors are made at each iteration k > 0: − if i = k if i = k + 1 . [sent-72, score-0.063]

33 k (i) = 0 otherwise With this error, we are now going to prove by induction on k that for all k ≥ 1,  k−1 if i < k  −γ  rk /2 − if i = k vk (i) = . [sent-73, score-0.654]

34  −(rk /2 − ) if i = k + 1  0 otherwise Since v0 = 0 the best action is clearly to move in every state i ≥ 2 which gives v1 = v0 + which establishes the claim for k = 1. [sent-74, score-0.142]

35 1 = 1 Assuming that our induction claim holds for k, we now show that it also holds for k + 1. [sent-75, score-0.069]

36 m m For the move action, write qk its action-value function. [sent-76, score-0.19]

37 For all i > 1 we have qk (i) = 0 + γvk (i − 1), hence  k−1 ) = −γ k if i = 2, . [sent-77, score-0.158]

38 qk (i) =  −γ(rk /2 − ) = −rk+1 /2 if i = k + 2  0 otherwise s s For the stay action, write qk its action-value function. [sent-81, score-0.364]

39 For all i > 0 we have qk (i) = ri + γvk (i), hence  k−1 ) = ri − γ k if i = 1, . [sent-82, score-0.244]

40 , k − 1  ri + γ(−γ   r + γ(r /2 − ) = r + r  k k k k+1 /2 if i = k s qk (i) = . [sent-85, score-0.201]

41 r − rk+1 /2 = rk+1 /2 if i = k + 1  rk+1 + γ0  k+2 = rk+2 if i = k + 2   0 otherwise First, only the stay action is available in state 1, hence, since r0 = 0 and k+1 (1) = 0, we have s vk+1 (1) = qk (1) + k+1 (1) = −γ k , as desired. [sent-86, score-0.299]

42 Second, since ri < 0 for all i > 1 we have m s m s qk (i) > qk (i) for all these states but k + 1 where qk (k + 1) = qk (k + 1) = rk+1 /2. [sent-87, score-0.7]

43 Using the fact m s that vk+1 = max(qk , qk ) + k+1 gives the result for vk+1 . [sent-88, score-0.158]

44 Since Example 1 shows that the bound of Theorem 1 is tight, improving performance bounds imply to modify the algorithms. [sent-90, score-0.087]

45 The following sections of the paper shows that considering non-stationary policies instead of stationary policies is an interesting path to follow. [sent-91, score-0.746]

46 3 4 Deducing a non-stationary policy from AVI While AVI (Equation (1)) is usually considered as generating a sequence of values v0 , v1 , . [sent-92, score-0.527]

47 , vk−1 , it also implicitely produces a sequence1 of policies π1 , π2 , . [sent-95, score-0.289]

48 Instead of outputing only the last policy πk , we here simply propose to output the periodic non-stationary policy πk,m that loops over the last m generated policies. [sent-102, score-1.152]

49 The following theorem shows that it is indeed a good idea. [sent-103, score-0.045]

50 For all iteration k and m such that 1 ≤ m ≤ k, the loss of running the non-stationary policy πk,m instead of the optimal policy π∗ satisfies: v∗ − vπk,m ∞ 2 1 − γm ≤ γ − γk + γ k v∗ − v 0 1−γ ∞ . [sent-105, score-1.111]

51 When m = 1 and k tends to infinity, one exactly recovers the result of Theorem 1. [sent-106, score-0.052]

52 For general m m, this new bound is a factor 1−γ better than the standard bound of Theorem 1. [sent-107, score-0.118]

53 The choice that 1−γ optimizes the bound, m = k, and which consists in looping over all the policies generated from the very start, leads to the following bound: v∗ − vπk,k that tends to 2γ 1−γ ∞ γ γk − 1−γ 1 − γk ≤2 + 2γ k v∗ − v0 1 − γk ∞, when k tends to ∞. [sent-108, score-0.41]

54 The rest of the section is devoted to the proof of Theorem 2. [sent-109, score-0.031]

55 We are now ready to prove the main result of this section. [sent-117, score-0.039]

56 Using the fact that T is a contraction in max-norm, we have: v∗ − vk ∞ = v∗ − T vk−1 + k ∞ ≤ T v∗ − T vk−1 ∞ + ≤ γ v∗ − vk−1 ∞ + . [sent-119, score-0.479]

57 1 A given sequence of value functions may induce many sequences of policies since more than one greedy policy may exist for one particular value function. [sent-120, score-0.9]

58 Our results holds for all such possible choices of greedy policies. [sent-121, score-0.042]

59 4 Then, by induction on k, we have that for all k ≥ 1, v ∗ − vk ∞ ≤ γ k v∗ − v0 ∞ + 1 − γk . [sent-122, score-0.531]

60 API algorithms for computing non-stationary policies We now present similar results that have a Policy Iteration flavour. [sent-124, score-0.289]

61 Unlike in the previous section where only the output of AVI needed to be changed, improving the bound for an API-like algorithm is slightly more involved. [sent-125, score-0.059]

62 In this section, we describe and analyze two API algorithms that output non-stationary policies with improved performance bounds. [sent-126, score-0.289]

63 , πk generated from the very start: vk ← vπk,k + k πk+1 ← any element of G(vk ) where the initial (stationary) policy π1,1 is chosen arbitrarily. [sent-130, score-0.985]

64 Thus, iteration after iteration, the nonstationary policy πk,k is made of more and more stationary policies, and this is why we refer to it as having a growing period. [sent-131, score-0.78]

65 We can prove the following performance bound for this algorithm: Theorem 3. [sent-132, score-0.079]

66 After k iterations, the loss of running the non-stationary policy πk,k instead of the optimal policy π∗ satisfies: v∗ − vπk,k ∞ ≤ 2(γ − γ k ) + γ k−1 v∗ − vπ1,1 1−γ When k tends to infinity, this bound tends to original API bound. [sent-133, score-1.211]

67 ∞ ≤ Vmax , and Finally, by induction on k, we obtain: v∗ − vπk,k ∞ ≤ 2(γ − γ k ) + γ k−1 v∗ − vπ1,1 1−γ ∞ + 2(k − 1)γ k Vmax . [sent-138, score-0.052]

68 Unlike the previous API algorithm, the non-stationary policy πk,m here only involves the last m greedy stationary policies instead of all of them, and is thus of fixed period. [sent-142, score-1.022]

69 For all m, for all k ≥ m, the loss of running the non-stationary policy πk,m instead of the optimal policy π∗ satisfies: v∗ − vπk,m ∞ ≤ γ k−m v∗ − vπm,m ∞ + 2(γ − γ k+1−m ) . [sent-145, score-1.048]

70 (1 − γ)(1 − γ m ) When m = 1 and k tends to infinity, we recover exactly the bound of Theorem 1. [sent-146, score-0.111]

71 When m > 1 and k tends to infinity, this bound coincides with that of Theorem 2 for our non-stationary version m of AVI: it is a factor 1−γ better than the standard bound of Theorem 1. [sent-147, score-0.188]

72 1−γ The rest of this section develops the proof of this performance bound. [sent-148, score-0.031]

73 A central argument of our proof is the following lemma, which shows that similarly to the standard API, our new algorithm has an (approximate) policy improvement property. [sent-149, score-0.554]

74 At each iteration of the algorithm, the value vπk+1,m of the non-stationary policy πk+1,m = πk+1 πk . [sent-151, score-0.59]

75 cannot be much worse than the value vπk,m of the non-stationary policy πk,m = πk−m+1 πk . [sent-160, score-0.527]

76 1 − γm The policy πk,m differs from πk+1,m in that every m steps, it chooses the oldest policy πk−m+1 instead of the newest one πk+1 . [sent-170, score-1.027]

77 Also πk,m is related to πk,m as follows: πk,m takes the first action according to πk−m+1 and then runs πk,m ; equivalently, since πk,m loops over πk πk−1 . [sent-171, score-0.108]

78 When there is no error (when = 0), this shows that the new policy πk+1,m is better than a “rotation” of πk,m . [sent-175, score-0.506]

79 When m = 1, πk+1,m = πk+1 and πk,m = πk and we thus recover the well-known (approximate) policy improvement theorem for standard API (see for instance [4, Lemma 6. [sent-176, score-0.551]

80 Since πk,m takes the first action with respect to πk−m+1 and then runs πk,m , we have vπk,m = Tπk−m+1 vπk,m . [sent-179, score-0.07]

81 from which we deduce that: vπk,m − vπk+1,m ≤ (I − Γk+1,m )−1 γ(Pπk+1 − Pπk−m+1 ) and the result follows by using the facts that k ∞ k ≤ and (I − Γk+1,m )−1 ∞ = 1 1−γ m . [sent-181, score-0.075]

82 We are now ready to prove the main result of this section. [sent-182, score-0.039]

83 (6) Consider the policy πk,m defined in Lemma 2. [sent-185, score-0.506]

84 By using the facts that v∗ ≥ vπk,m , v∗ ≥ vπk+1,m and Lemma 2, we get v∗ − vπk+1,m ≤ γ v∗ − vπk,m ∞ + 2γ + = γ v∗ − vπk,m ∞ ∞ + γ m (2γ ) 1 − γm 2γ . [sent-187, score-0.052]

85 1 − γm Finally, we obtain by induction that for all k ≥ m, v∗ − vπk,m ∞ ≤ γ k−m v∗ − vπm,m 7 ∞ + 2(γ − γ k+1−m ) . [sent-188, score-0.052]

86 (1 − γ)(1 − γ m ) 6 Discussion, conclusion and future work We recalled in Theorem 1 the standard performance bound when computing an approximately optimal stationary policy with the standard AVI and API algorithms. [sent-189, score-0.739]

87 From a bibliographical point of view, it is the work of [14] that made us think that non-stationary policies may lead to better performance bounds. [sent-191, score-0.289]

88 In that work, the author considers problems with a finite-horizon T for which one computes non-stationary policies with performance bounds in O(T ), and infinite-horizon problems for which one computes stationary policies with performance 1 bounds in O( (1−γ)2 ). [sent-192, score-0.787]

89 Using the informal equivalence of the horizons T 1−γ one sees that non-stationary policies look better than stationary policies. [sent-193, score-0.477]

90 In [14], non-stationary policies are only computed in the context of finite-horizon (and thus non-stationary) problems; the fact that nonstationary policies can also be useful in an infinite-horizon stationary context is to our knowledge completely new. [sent-194, score-0.773]

91 The best performance improvements are obtained when our algorithms consider periodic nonstationary policies of which the period grows to infinity, and thus require an infinite memory, which may look like a practical limitation. [sent-195, score-0.443]

92 In practice, it is easy to see that by choosing m = 1−γ , that is a memory that scales linearly with the horizon (and thus the difficulty) of the problem, one can get a 2γ performance bound of2 (1−e−1 )(1−γ) ≤ 3. [sent-197, score-0.075]

93 1−γ 2γ We conjecture that our asymptotic bound of 1−γ , and the non-asymptotic bounds of Theorems 2 and 4 are tight. [sent-199, score-0.106]

94 The actual proof of this conjecture is left for future work. [sent-200, score-0.05]

95 Important recent works of the literature involve studying performance bounds when the errors are controlled in Lp norms instead of max-norm [19, 20, 21, 1, 8, 18, 17] which is natural when supervised learning algorithms are used to approximate the evaluation steps of AVI and API. [sent-201, score-0.068]

96 Since our proof are based on componentwise bounds like those of the pioneer works in this topic [19, 20], we believe that the extension of our analysis to Lp norm analysis is straightforward. [sent-202, score-0.059]

97 Learning near-optimal policies with Bellmana residual minimization based fitted policy iteration and a single sample path. [sent-208, score-0.873]

98 Approximate policy iteration: a survey and some new methods. [sent-220, score-0.506]

99 Error propagation for approximate policy a and value iteration (extended version). [sent-266, score-0.615]

100 Finite time bounds for sampling based fitted value iteration. [sent-337, score-0.049]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.506), ('vk', 0.479), ('api', 0.412), ('policies', 0.289), ('avi', 0.269), ('qk', 0.158), ('stationary', 0.153), ('tk', 0.115), ('vmax', 0.111), ('rk', 0.085), ('action', 0.07), ('periodic', 0.068), ('iteration', 0.063), ('bound', 0.059), ('induction', 0.052), ('tends', 0.052), ('facts', 0.052), ('bellman', 0.051), ('stay', 0.048), ('ghavamzadeh', 0.047), ('lazaric', 0.047), ('tight', 0.047), ('theorem', 0.045), ('munos', 0.044), ('ri', 0.043), ('greedy', 0.042), ('nonstationary', 0.042), ('lemma', 0.039), ('arguing', 0.039), ('loops', 0.038), ('szepesv', 0.038), ('reward', 0.037), ('nity', 0.037), ('mdp', 0.034), ('move', 0.032), ('proof', 0.031), ('farahmand', 0.03), ('jmlr', 0.029), ('bounds', 0.028), ('reinforcement', 0.028), ('discounted', 0.027), ('operator', 0.026), ('lp', 0.026), ('tightness', 0.026), ('ds', 0.026), ('rmax', 0.025), ('approximate', 0.025), ('period', 0.025), ('states', 0.025), ('dynamic', 0.024), ('deduce', 0.023), ('state', 0.023), ('decision', 0.022), ('factored', 0.022), ('value', 0.021), ('finite', 0.021), ('optimal', 0.021), ('guestrin', 0.021), ('sequence', 0.021), ('inria', 0.02), ('prove', 0.02), ('ready', 0.019), ('conjecture', 0.019), ('look', 0.019), ('discount', 0.019), ('st', 0.018), ('tted', 0.018), ('coincides', 0.018), ('rotation', 0.018), ('going', 0.018), ('conference', 0.017), ('france', 0.017), ('lagoudakis', 0.017), ('puting', 0.017), ('gabillon', 0.017), ('lauderdale', 0.017), ('looping', 0.017), ('boris', 0.017), ('last', 0.017), ('koller', 0.017), ('claim', 0.017), ('big', 0.017), ('argument', 0.017), ('programming', 0.016), ('satis', 0.016), ('growing', 0.016), ('azar', 0.016), ('asian', 0.016), ('wiering', 0.016), ('scherrer', 0.016), ('bruno', 0.016), ('petrik', 0.016), ('horizons', 0.016), ('tsitsiklis', 0.016), ('memory', 0.016), ('deterministic', 0.016), ('residual', 0.015), ('instead', 0.015), ('arti', 0.015), ('biasing', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

2 0.3236964 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

3 0.26629806 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

4 0.26597294 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

5 0.24069692 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

6 0.23216219 160 nips-2012-Imitation Learning by Coaching

7 0.22511537 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

8 0.2195354 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

9 0.21179424 358 nips-2012-Value Pursuit Iteration

10 0.19827212 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

11 0.19241923 344 nips-2012-Timely Object Recognition

12 0.17324062 234 nips-2012-Multiresolution analysis on the symmetric group

13 0.16648792 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

14 0.15355588 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

15 0.14446773 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

16 0.1421452 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

17 0.12879819 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

18 0.11586288 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.10014346 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

20 0.096228756 51 nips-2012-Bayesian Hierarchical Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, -0.473), (2, -0.014), (3, -0.092), (4, -0.043), (5, 0.058), (6, 0.006), (7, -0.033), (8, 0.056), (9, -0.027), (10, 0.031), (11, -0.001), (12, -0.005), (13, 0.035), (14, 0.048), (15, -0.033), (16, 0.092), (17, -0.01), (18, 0.069), (19, 0.015), (20, -0.056), (21, -0.0), (22, 0.072), (23, -0.011), (24, -0.003), (25, 0.075), (26, -0.064), (27, 0.045), (28, -0.113), (29, 0.019), (30, -0.021), (31, 0.006), (32, -0.031), (33, -0.054), (34, 0.085), (35, -0.065), (36, -0.058), (37, 0.101), (38, -0.055), (39, 0.113), (40, -0.006), (41, 0.009), (42, -0.089), (43, 0.028), (44, 0.032), (45, -0.077), (46, -0.042), (47, 0.109), (48, 0.004), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97500211 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

2 0.78389442 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

3 0.75463843 348 nips-2012-Tractable Objectives for Robust Policy Optimization

Author: Katherine Chen, Michael Bowling

Abstract: Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP. 1

4 0.72613418 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

Author: Aaron Wilson, Alan Fern, Prasad Tadepalli

Abstract: We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the agent presents an expert with short runs of a pair of policies originating from the same state and the expert indicates which trajectory is preferred. The agent’s goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection. 1

5 0.71070445 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

Author: Tsuyoshi Ueno, Kohei Hayashi, Takashi Washio, Yoshinobu Kawahara

Abstract: Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, weighted likelihood policy search (WLPS), where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the information criterion, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.

6 0.70605671 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

7 0.69955057 160 nips-2012-Imitation Learning by Coaching

8 0.68445951 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

9 0.68109387 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

10 0.66282701 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

11 0.64760834 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

12 0.61081141 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

13 0.60660511 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

14 0.59885079 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

15 0.57911181 358 nips-2012-Value Pursuit Iteration

16 0.57237321 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

17 0.56632978 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes

18 0.55378145 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.52270204 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

20 0.45678321 51 nips-2012-Bayesian Hierarchical Reinforcement Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.013), (21, 0.012), (36, 0.017), (38, 0.181), (42, 0.029), (53, 0.014), (54, 0.091), (55, 0.011), (57, 0.176), (74, 0.07), (76, 0.108), (80, 0.099), (92, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85503 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

Author: Bruno Scherrer, Boris Lesner

Abstract: We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error at each iteration, it is well-known that one 2γ can compute stationary policies that are (1−γ)2 -optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for com2γ puting non-stationary policies that can be up to 1−γ -optimal, which constitutes a significant improvement in the usual situation when γ is close to 1. Surprisingly, this shows that the problem of “computing near-optimal non-stationary policies” is much simpler than that of “computing near-optimal stationary policies”. 1

2 0.84629142 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

3 0.80142903 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

4 0.79060447 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

5 0.78989542 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

6 0.78956997 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.78799677 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

8 0.78651595 344 nips-2012-Timely Object Recognition

9 0.78526896 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

10 0.78488153 160 nips-2012-Imitation Learning by Coaching

11 0.78033602 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

12 0.78003073 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

13 0.77977359 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

14 0.77946281 292 nips-2012-Regularized Off-Policy TD-Learning

15 0.77819979 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

16 0.77805591 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

17 0.77670038 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

18 0.77662909 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

19 0.77582943 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

20 0.77544552 199 nips-2012-Link Prediction in Graphs with Autoregressive Features