nips nips2013 nips2013-140 knowledge-graph by maker-knowledge-mining

140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Source: pdf

Author: Bruno Scherrer

Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. [sent-3, score-0.282]

2 We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. [sent-4, score-0.314]

3 This generalizes a recent result for deterministic MDPs by [8], in which ·t Æ n and ·r Æ n. [sent-8, score-0.083]

4 1 Introduction We consider a discrete-time dynamic system whose state transition depends on a control. [sent-11, score-0.069]

5 The control a œ A speciﬁes the transition probability pij (a) = P(it+1 = j|it = i, at = a) to the next state j. [sent-16, score-0.126]

6 In this context, we look for a stationary deterministic policy (a function ﬁ : X æ A that maps 1 In the works of [11, 8, 3] that we reference, the integer “m” denotes the total number of actions, that is nm with our notation. [sent-18, score-0.427]

7 1 states into controls2 ) that maximizes the expected discounted sum of rewards from any state i, called the value of policy ﬁ at state i: C Œ D ÿ k vﬁ (i) := E “ r(ik , ak , ik+1 )- i0 = i, ’k Ø 0, ak = ﬁ(ik ), ik+1 ≥ P(·|ik , ak ) k=0 where “ œ (0, 1) is a discount factor. [sent-20, score-0.653]

8 The optimal value starting from state i is deﬁned as vú (i) := max vﬁ (i). [sent-22, score-0.098]

9 ﬁ For any policy ﬁ, we write Pﬁ for the n ◊ n stochastic matrix whose elements are pij (ﬁ(i)) q and rﬁ the vector whose components are j pij (ﬁ(i))r(i, ﬁ(i), j). [sent-23, score-0.318]

10 It is also well known that vú satisﬁes the following Bellman equation: vú = max(rﬁ + “Pﬁ vú ) = max Tﬁ vú ﬁ ﬁ where the max operator is componentwise. [sent-26, score-0.075]

11 For any value vector v, we say that a policy ﬁ is greedy with respect to the value v if it satisﬁes: ﬁ œ arg max TﬁÕ v Õ ﬁ or equivalently Tﬁ v = T v. [sent-28, score-0.32]

12 With some slight abuse of notation, we write G(v) for any policy that is greedy with respect to v. [sent-29, score-0.291]

13 The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy ﬁú that is greedy with respect to the optimal value vú is an optimal policy and its value vﬁú is equal to vú . [sent-30, score-0.746]

14 Õ ﬁ We call the set of switchable states of ﬁ the following set Sﬁ = {i, aﬁ (i) > 0}. [sent-33, score-0.1]

15 For any non-empty subset Y of Sﬁ , we denote switch(ﬁ, Y ) a policy satisfying: ; G(vﬁ )(i) if i œ Y ’i, switch(ﬁ, Y )(i) = ﬁ(i) if i ”œ Y. [sent-35, score-0.24]

16 This lemma is the foundation of the well-known iterative procedure, called Policy Iteration (PI), that generates a sequence of policies (ﬁk ) as follows. [sent-40, score-0.2]

17 The choice for the subsets Yk leads to di erent variations of PI. [sent-42, score-0.106]

18 In this paper we will focus on two speciﬁc variations: 2 Restricting our attention to stationary deterministic policies is not a limitation. [sent-43, score-0.206]

19 Indeed, for the optimality criterion to be deﬁned soon, it can be shown that there exists at least one stationary deterministic policy that is optimal [9]. [sent-44, score-0.369]

20 2 • When for all iterations k, Yk = Sﬁk , that is one switches the actions in all states with positive advantage with respect to ﬁk , the above algorithm is known as Howard’s PI; it can be seen then that ﬁk+1 œ G(vﬁk ). [sent-45, score-0.27]

21 • When for all k, Yk is a singleton containing a state ik œ arg maxi aﬁk (i), that is if we only switch one action in the state with maximal advantage with respect to ﬁk , we will call it Simplex-PI3 . [sent-46, score-0.319]

22 Since it generates a sequence of policies with increasing values, any variation of PI converges to the optimal policy in a number of iterations that is smaller than the total number of policies mn . [sent-47, score-0.574]

23 The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and Simplex-PI that are much sharper than mn . [sent-50, score-0.135]

24 In the next sections, we describe some known results—see [11] for a recent and comprehensive review—about the number of iterations required by Howard’s PI and Simplex-PI, along with some of our original improvements and extensions. [sent-51, score-0.085]

25 4 2 Bounds with respect to a Fixed Discount Factor “ < 1 A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequence they generate satisﬁes some contraction property5 . [sent-52, score-0.122]

26 The sequence (Îvú ≠ vﬁk ÎŒ )kØ0 built by Howard’s PI is contracting with coe cient “. [sent-56, score-0.132]

27 The sequence (1T (vú ≠ vﬁk ))kØ0 built by Simplex-PI is contracting with coe cient 1 ≠ 1≠“ . [sent-58, score-0.132]

28 n Though this observation is widely known for Howard’s PI, it was to our knowledge never mentionned explicitly in the literature for Simplex-PI. [sent-59, score-0.089]

29 These contraction properties have the following immediate consequence6 . [sent-60, score-0.077]

30 Let Vmax = maxﬁ Îrﬁ ÎŒ be an upper bound on Îvﬁ ÎŒ for all policies ﬁ. [sent-62, score-0.119]

31 In 1≠“ order to get an ‘-optimal policy, that is a policy ﬁk satisfying Îvú ≠ vﬁk ÎŒ ÆÏ‘, Howard’s Ï Ì Ì PI requires at most iterations. [sent-63, score-0.264]

32 log Vmax ‘ 1≠“ iterations, while Simplex-PI requires at most n log nVmax ‘ 1≠“ These bounds depend on the precision term ‘, which means that Howard’s PI and SimplexPI are weakly polynomial for a ﬁxed discount factor “. [sent-64, score-0.348]

33 An important breakthrough was recently achieved by [11] who proved that one can remove the dependency with respect to ‘, and thus show that Howard’s PI and Simplex-PI are strongly polynomial for a ﬁxed discount factor “. [sent-65, score-0.262]

34 Simplex-PI and Howard’s PI both terminate after at most n(m ≠ 1 Ï 2Ì 1) n 1≠“ log n2 1≠“ iterations. [sent-67, score-0.076]

35 5 A sequence of non-negative numbers (xk )kØ0 is contracting with coe cient – if and only if for all k Ø 0, xk+1 Æ –xk . [sent-72, score-0.132]

36 Îvú ≠ vﬁ0 Î1 Æ 1 ≠ 3 > 1≠“ n log " Vmax ‘ . [sent-77, score-0.076]

37 For Simplex-PI, we 1 log “ k nVmax , and the conclusion The proof is based on the fact that PI corresponds to the simplex algorithm in a linear programming formulation of the MDP problem. [sent-78, score-0.169]

38 Howard’s PI terminates after at most (nm + 1) 1≠“ log 1≠“ iterations. [sent-81, score-0.252]

39 Our ﬁrst two results, that are consequences of the contraction properties (Lemmas 2 and 3), are stated in the following theorems. [sent-82, score-0.077]

40 Howard’s PI terminates after at most n(m ≠ 3 2Ì Ï 1) 1 1≠“ log 1 1≠“ iterations. [sent-84, score-0.252]

41 Simplex-PI terminates after at most n(m ≠ Ï 2Ì n n 1) 1≠“ log 1≠“ iterations. [sent-86, score-0.252]

42 Our result for Simplex-PI is only very slightly better (by a factor 2) than that of [11], and uses a proof that is more direct. [sent-88, score-0.089]

43 Using more reﬁned argument, we managed to also improve the bound for Simplex-PI by a factor O(log n). [sent-89, score-0.059]

44 Simplex-PI terminates after at most n2 (m ≠ 1 2 1) 1 + 2 1≠“ 1 log 1≠“ iterations. [sent-91, score-0.252]

45 Compared to Howard’s PI, our bound for Simplex-PI is a factor O(n) larger. [sent-92, score-0.059]

46 It is easy to see that the linear dependency of the bound for Howard’s PI with respect to n is optimal. [sent-95, score-0.09]

47 We conjecture that the linear dependency of both bounds with respect to 1 m is also optimal. [sent-96, score-0.098]

48 The dependency with respect to the term 1≠“ may be improved, but removing it is impossible for Howard’s PI and very unlikely for Simplex-PI. [sent-97, score-0.088]

49 [2] describes an MDP for which Howard’s PI requires an exponential (in n) number of iterations for “ = 1 and [5] argued that this holds also when “ is in the vicinity of 1. [sent-98, score-0.117]

50 Though a similar result does not seem to exist for Simplex-PI in the literature, [7] consider four variations of PI that all switch one action per iteration, and show through speciﬁcally designed MDPs that they may require an exponential (in n) number of iterations when “ = 1. [sent-99, score-0.251]

51 On this topic, [8] recently showed the following result for deterministic MDPs. [sent-101, score-0.083]

52 If the MDP is deterministic, then Simplex-PI terminates after at most O(n5 m2 log2 n) iterations. [sent-103, score-0.176]

53 Given a policy ﬁ of a deterministic MDP, states are either on cycles or on paths induced by ﬁ. [sent-104, score-0.479]

54 The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that signiﬁcant progress is made every time a new cycle appears; in other words, signiﬁcant progress is made regularly. [sent-105, score-0.273]

55 If the MDP is deterministic, after at most nmÁ2(n ≠ 1) log nË iterations, either Simplex-PI ﬁnishes or a new cycle appears. [sent-107, score-0.13]

56 n 4 Indeed, these observations su ce to prove7 that Simplex-PI terminates after n ˜ O(n4 m2 log 1≠“ ) = O(n4 m2 ). [sent-110, score-0.288]

57 Removing completely the dependency with respect to the 1 discount factor “—the term in O(log 1≠“ )—requires a careful extra work described in [8], which incurs an extra term of order O(n log(n)). [sent-111, score-0.184]

58 For any policy ﬁ and state 1 2 n i, we trivially have xﬁ (i) œ 1, 1≠“ . [sent-113, score-0.29]

59 The proof exploits the fact that xﬁ (i) belongs to the 1 n set (1, n) when i is on a path of ﬁ, while xﬁ (i) belongs to the set ( 1≠“ , 1≠“ ) when i is on a cycle of ﬁ. [sent-114, score-0.107]

60 Given a policy ﬁ of a stochastic MDP, states are either in recurrent classes or transient classes (these two categories respectively generalize those of cycles and paths). [sent-116, score-0.826]

61 Let ·t Ø 1 and ·r Ø 1 be the smallest constants such that for all policies ﬁ and all states i, (1 Æ )xﬁ (i) Æ ·t 3 4 n n Æ xﬁ (i) Æ (1 ≠ “)·r 1≠“ if i is transient for ﬁ, and if i is recurrent for ﬁ. [sent-119, score-0.578]

62 ·r ) can be seen as a measure of the time needed to leave transient states (resp. [sent-121, score-0.293]

63 the time needed to revisit states in recurrent classes). [sent-122, score-0.316]

64 If the MDP satisﬁes Assumption 1, after at most # $ n (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë iterations either Simplex-PI ﬁnishes or a new recurrent class appears. [sent-127, score-0.274]

65 If the MDP is aperiodic and irreducible, and thus admits a stationary distribution ‹ﬁ for any policy ﬁ, one can see that 1 = min ‹ﬁ (i). [sent-129, score-0.285]

66 If the MDP satisﬁes Assumption 1, when Simplex-PI moves from ﬁ to ﬁ Õ where ﬁ Õ involves a new recurrent class, we have 3 4 1 T 1 (vﬁú ≠ vﬁÕ ) Æ 1 ≠ 1T (vﬁú ≠ vﬁ ). [sent-131, score-0.214]

67 ·r From these generalized observations, we can deduce the following original result. [sent-132, score-0.069]

68 If the MDP satisﬁes Assumption 1, then Simplex-PI terminates after at most # $ n2 (m ≠ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë iterations. [sent-135, score-0.176]

69 This new result is a strict generalization of the result for deterministic MDPs. [sent-137, score-0.083]

70 Indeed, in the deterministic case, we have ·t Æ n and ·r Æ n, and it is is easy to see that Lemmas 6, 7 and Theorem 7 respectively imply Lemmas 4, 5 and Theorem 6. [sent-138, score-0.083]

71 An immediate consequence of the above result is that Simplex-PI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentionned in Theorem 6. [sent-139, score-0.25]

72 For any family of MDPs indexed by n and m such that ·t and ·r are polynomial functions of n and m, Simplex-PI terminates after a number of steps that is polynomial in n and m. [sent-141, score-0.272]

73 Unfortunately, and as quickly mentionned by [8], the line of analysis developped for Simplex-PI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way that the policy improvement turns out to be small. [sent-144, score-0.437]

74 If the MDP satisﬁes Assumption 1, after at most nmÁ·t log n·t Ë iterations, either Howard’s PI ﬁnishes or a new recurrent class appears. [sent-151, score-0.265]

75 In a recent deterministic example due to [4] to show that Howard’s PI may require at most O(n2 ) iterations, new cycles are created every single iteration but the sequence of values 2 satisﬁes9 for all iterations k < n + n and states i, 4 4 C 3 4k D 2 vú (i) ≠ vﬁk+1 (i) Ø 1 ≠ (vú (i) ≠ vﬁk (i)). [sent-154, score-0.405]

76 n Contrary to Lemma 5, as k grows, the amount of contraction gets (exponentially) smaller and smaller. [sent-155, score-0.077]

77 With respect to Simplex-PI, this suggests that Howard’s PI may su er from subtle speciﬁc pathologies. [sent-156, score-0.062]

78 In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. [sent-157, score-0.085]

79 In the simplest—deterministic—case, the question is still open: the currently best known lower bound is the O(n2 ) bound by [4] we have just mentionned, n while the best known upper bound is O( m ) (valid for all MDPs) due to [6]. [sent-159, score-0.069]

80 n 9 This MDP has an even number of states n = 2p. [sent-160, score-0.1]

81 The policies generated by Howard’s PI have values vﬁk (i) œ (pN ≠k≠1 , pN ≠k ). [sent-163, score-0.096]

82 We deduce that for all iterations k and states i, vú (i)≠vﬁk+1 (i) vú (i)≠vﬁk (i) Ø 1+p≠k≠2 1+p≠k =1≠ p≠k ≠p≠k≠2 1+p≠k 6 Ø 1 ≠ p≠k (1 ≠ p≠2 ) Ø 1 ≠ p≠k . [sent-164, score-0.254]

83 The state space X can be partitioned in two sets T and R such that for all policies ﬁ, the states of T are transient and those of R are recurrent. [sent-167, score-0.459]

84 For an MDP satisfying Assumptions 1-2, suppose Howard’s PI moves from ﬁ to ﬁ Õ and that ﬁ Õ involves a new recurrent class. [sent-170, score-0.238]

85 ·r And we can deduce the following original bound (that also applies to Simplex-PI). [sent-172, score-0.092]

86 If the MDP satisﬁes Assumptions 1-2, then Howard’s PI terminates after at most n(m ≠ 1) (Á·t log n·t Ë + Á·r log n·r Ë) iterations, while Simplex-PI terminates after at most n(m ≠ 1) (Án·t log n·t Ë + Á·r log n·r Ë) iterations. [sent-175, score-0.656]

87 It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). [sent-177, score-1.353]

88 Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. [sent-179, score-0.238]

89 Since vﬁú ≠ vﬁk is non negative, we can take the max norm and get: Îvﬁú ≠ vﬁk ÎŒ Æ “Îvﬁú ≠ vﬁk≠1 ÎŒ . [sent-182, score-0.057]

90 B Contraction property for Simplex-PI (Proof of Lemma 3) By using the fact that {vﬁ = Tﬁ vﬁ ∆ vﬁ = (I ≠ “Pﬁ )≠1 rﬁ }, we have that for all pairs of policies ﬁ and ﬁ Õ . [sent-183, score-0.096]

91 (1) On the one hand, by using this lemma and the fact that {Tﬁk+1 vﬁk ≠ vﬁk Ø 0}, we have for any k: vﬁk+1 ≠ vﬁk = (I ≠ “Pk+1 )≠1 (Tﬁk+1 vﬁk ≠ vﬁk ) Ø Tﬁk+1 vﬁk ≠ vﬁk , which implies that 1T (vﬁk+1 ≠ vﬁk ) Ø 1T (Tﬁk+1 vﬁk ≠ vﬁk ). [sent-185, score-0.085]

92 From Equation (4), we deduce that for all k, = “ k Î(I ≠ “Pﬁ0 )≠1 (vú ≠ Tﬁ0 vú )ÎŒ Æ vú (s0 ) ≠ [Tﬁk vú ](s0 ) Æ Îvú ≠ Tﬁk vú ÎŒ Æ “k “k Îvú ≠ Tﬁ0 vú ÎŒ = (vú (s0 ) ≠ [Tﬁ0 vú ](s0 )). [sent-191, score-0.069]

93 1≠“ 1≠“ “ As a consequence, the action ﬁk (s0 ) must be di erent from ﬁ0 (s0 ) when 1≠“ < 1, that is for Ï log 1 Ì Ï log 1 Ì 1≠“ all values of k satisfying k Ø k ú = > log1≠“ . [sent-192, score-0.293]

94 In other words, if some policy ﬁ 1 1≠“ k “ is not optimal, then one of its non-optimal actions will be eliminated for good after at most k ú iterations. [sent-193, score-0.326]

95 By repeating this argument, one can eliminate all non-optimal actions (they are at most n(m ≠ 1)), and the result follows. [sent-194, score-0.059]

96 Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. [sent-214, score-0.197]

97 The complexity of policy iteration is exponential for discounted markov decision processes. [sent-228, score-0.392]

98 On the complexity of the policy improvement algorithm for markov decision processes. [sent-239, score-0.309]

99 The simplex method is strongly polynomial for deterministic markov decision processes. [sent-244, score-0.27]

100 The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a ﬁxed discount rate. [sent-257, score-0.268]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('howard', 0.666), ('pi', 0.359), ('policy', 0.24), ('mdp', 0.221), ('transient', 0.193), ('recurrent', 0.189), ('terminates', 0.176), ('states', 0.1), ('vmax', 0.098), ('policies', 0.096), ('mentionned', 0.089), ('lemma', 0.085), ('iterations', 0.085), ('deterministic', 0.083), ('discount', 0.081), ('nm', 0.077), ('contraction', 0.077), ('log', 0.076), ('yk', 0.07), ('deduce', 0.069), ('lemmas', 0.069), ('mdps', 0.066), ('switch', 0.062), ('actions', 0.059), ('nishes', 0.059), ('coe', 0.059), ('ik', 0.058), ('cycles', 0.056), ('cycle', 0.054), ('contracting', 0.054), ('proof', 0.053), ('maxs', 0.051), ('state', 0.05), ('erent', 0.048), ('polynomial', 0.048), ('discounted', 0.045), ('nvmax', 0.044), ('decision', 0.043), ('action', 0.043), ('dependency', 0.041), ('lim', 0.04), ('simplex', 0.04), ('pij', 0.039), ('iteration', 0.038), ('factor', 0.036), ('su', 0.036), ('vicinity', 0.032), ('variations', 0.032), ('bounds', 0.031), ('hansen', 0.031), ('maximal', 0.03), ('strongly', 0.03), ('ak', 0.029), ('seem', 0.029), ('max', 0.029), ('pn', 0.028), ('non', 0.028), ('satis', 0.027), ('stationary', 0.027), ('revisit', 0.027), ('eliminated', 0.027), ('markov', 0.026), ('respect', 0.026), ('di', 0.026), ('equation', 0.025), ('xk', 0.025), ('theorem', 0.025), ('greedy', 0.025), ('moves', 0.025), ('assumption', 0.025), ('created', 0.024), ('satisfying', 0.024), ('classes', 0.024), ('bellman', 0.023), ('bound', 0.023), ('france', 0.023), ('appendix', 0.022), ('facts', 0.022), ('structural', 0.021), ('unlikely', 0.021), ('partitioned', 0.02), ('isaac', 0.02), ('lorraine', 0.02), ('interfere', 0.02), ('zeitschrift', 0.02), ('sequence', 0.019), ('es', 0.019), ('optimal', 0.019), ('mn', 0.019), ('transition', 0.019), ('control', 0.018), ('icalp', 0.018), ('manage', 0.018), ('aperiodic', 0.018), ('tor', 0.018), ('operator', 0.017), ('open', 0.017), ('material', 0.017), ('regularly', 0.017), ('scherrer', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Author: Bruno Scherrer

2 0.22266586 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have signiﬁcantly grown in popularity in the reinforcement–learning ﬁeld. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identiﬁcation of effective gradient directions and the proposal of efﬁcient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly inﬂuenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simpliﬁed version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

3 0.21611799 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our ﬁrst main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

4 0.16832037 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

5 0.15938281 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efﬁciently and the comparison set ⇧ has polynomial size, this algorithm is efﬁcient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identiﬁed start and ﬁnish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to ﬁnish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a ﬁxed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efﬁciently solved. We show that it also can be efﬁciently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efﬁcient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difﬁcult problem that has been used to design efﬁcient cryptographic schemes. Finally, we present an efﬁcient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.15875049 348 nips-2013-Variational Policy Search via Trajectory Optimization

7 0.15273553 347 nips-2013-Variational Planning for Graph-based MDPs

8 0.14566089 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

9 0.13952589 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

10 0.13744472 241 nips-2013-Optimizing Instructional Policies

11 0.13731724 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

12 0.12801975 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

13 0.12373927 257 nips-2013-Projected Natural Actor-Critic

14 0.12363046 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

15 0.12312374 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

16 0.12119008 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

17 0.11502854 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

18 0.11250609 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

19 0.10583612 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

20 0.09758848 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.251), (2, -0.099), (3, 0.128), (4, -0.013), (5, 0.02), (6, -0.066), (7, 0.019), (8, -0.008), (9, 0.012), (10, 0.104), (11, -0.027), (12, 0.023), (13, 0.036), (14, 0.076), (15, -0.032), (16, 0.027), (17, -0.015), (18, -0.029), (19, 0.04), (20, 0.011), (21, 0.007), (22, -0.046), (23, 0.002), (24, -0.042), (25, -0.009), (26, 0.018), (27, -0.032), (28, -0.071), (29, -0.056), (30, -0.051), (31, -0.029), (32, -0.029), (33, 0.06), (34, -0.035), (35, 0.005), (36, -0.032), (37, 0.084), (38, 0.003), (39, -0.057), (40, 0.01), (41, 0.055), (42, -0.107), (43, -0.012), (44, -0.051), (45, -0.018), (46, 0.09), (47, 0.032), (48, -0.124), (49, -0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95630962 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Author: Bruno Scherrer

2 0.78778511 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our ﬁrst contribution is a symbolic implementation of Modiﬁed Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show signiﬁcantly improved scalability over state-of-the-art symbolic planners. 1

3 0.74708861 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

4 0.72897267 347 nips-2013-Variational Planning for Graph-based MDPs

Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler

Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is ﬁrst approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at ﬁnding good policies. 1

5 0.70888382 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

6 0.70089304 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

7 0.66754639 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

8 0.63235921 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

9 0.63163269 241 nips-2013-Optimizing Instructional Policies

10 0.62679023 257 nips-2013-Projected Natural Actor-Critic

11 0.62378377 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

12 0.62217331 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

13 0.61574954 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

14 0.60806245 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

15 0.59643066 348 nips-2013-Variational Policy Search via Trajectory Optimization

16 0.5739131 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

17 0.56169164 165 nips-2013-Learning from Limited Demonstrations

18 0.55477607 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

19 0.5411188 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

20 0.52954847 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.015), (21, 0.012), (33, 0.101), (34, 0.103), (41, 0.02), (49, 0.012), (56, 0.15), (70, 0.026), (85, 0.396), (89, 0.025), (93, 0.018), (95, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93332088 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

Author: Tai Qin, Karl Rohe

Abstract: Spectral clustering is a fast and popular algorithm for ﬁnding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artiﬁcially inﬂate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justiﬁes several of the variations of the spectral clustering algorithm in terms of these models. 1

2 0.91290325 7 nips-2013-A Gang of Bandits

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella

Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More speciﬁcally, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1

3 0.85387135 332 nips-2013-Tracking Time-varying Graphical Structure

Author: Erich Kummerfeld, David Danks

Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and signiﬁcantly better when it is only locally stationary. 1

same-paper 4 0.85079205 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Author: Bruno Scherrer

5 0.84451735 168 nips-2013-Learning to Pass Expectation Propagation Messages

Author: Nicolas Heess, Daniel Tarlow, John Winn

Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a signiﬁcant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difﬁcult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1

6 0.74813455 89 nips-2013-Dimension-Free Exponentiated Gradient

7 0.71952587 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

8 0.70724803 196 nips-2013-Modeling Overlapping Communities with Node Popularities

9 0.69040555 232 nips-2013-Online PCA for Contaminated Data

10 0.67725182 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

11 0.67356014 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

12 0.66432017 289 nips-2013-Scalable kernels for graphs with continuous attributes

13 0.66220289 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

14 0.64329022 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

15 0.63419592 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

16 0.63078082 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

17 0.62739593 25 nips-2013-Adaptive Anonymity via $b$-Matching

18 0.6219303 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

19 0.61609161 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

20 0.61192471 79 nips-2013-DESPOT: Online POMDP Planning with Regularization