nips nips2013 nips2013-140 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. [sent-3, score-0.282]
2 We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. [sent-4, score-0.314]
3 This generalizes a recent result for deterministic MDPs by [8], in which ·t Æ n and ·r Æ n. [sent-8, score-0.083]
4 1 Introduction We consider a discrete-time dynamic system whose state transition depends on a control. [sent-11, score-0.069]
5 The control a œ A specifies the transition probability pij (a) = P(it+1 = j|it = i, at = a) to the next state j. [sent-16, score-0.126]
6 In this context, we look for a stationary deterministic policy (a function fi : X æ A that maps 1 In the works of [11, 8, 3] that we reference, the integer “m” denotes the total number of actions, that is nm with our notation. [sent-18, score-0.427]
7 1 states into controls2 ) that maximizes the expected discounted sum of rewards from any state i, called the value of policy fi at state i: C Œ D ÿ k vfi (i) := E “ r(ik , ak , ik+1 )- i0 = i, ’k Ø 0, ak = fi(ik ), ik+1 ≥ P(·|ik , ak ) k=0 where “ œ (0, 1) is a discount factor. [sent-20, score-0.653]
8 The optimal value starting from state i is defined as vú (i) := max vfi (i). [sent-22, score-0.098]
9 fi For any policy fi, we write Pfi for the n ◊ n stochastic matrix whose elements are pij (fi(i)) q and rfi the vector whose components are j pij (fi(i))r(i, fi(i), j). [sent-23, score-0.318]
10 It is also well known that vú satisfies the following Bellman equation: vú = max(rfi + “Pfi vú ) = max Tfi vú fi fi where the max operator is componentwise. [sent-26, score-0.075]
11 For any value vector v, we say that a policy fi is greedy with respect to the value v if it satisfies: fi œ arg max TfiÕ v Õ fi or equivalently Tfi v = T v. [sent-28, score-0.32]
12 With some slight abuse of notation, we write G(v) for any policy that is greedy with respect to v. [sent-29, score-0.291]
13 The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy fiú that is greedy with respect to the optimal value vú is an optimal policy and its value vfiú is equal to vú . [sent-30, score-0.746]
14 Õ fi We call the set of switchable states of fi the following set Sfi = {i, afi (i) > 0}. [sent-33, score-0.1]
15 For any non-empty subset Y of Sfi , we denote switch(fi, Y ) a policy satisfying: ; G(vfi )(i) if i œ Y ’i, switch(fi, Y )(i) = fi(i) if i ”œ Y. [sent-35, score-0.24]
16 This lemma is the foundation of the well-known iterative procedure, called Policy Iteration (PI), that generates a sequence of policies (fik ) as follows. [sent-40, score-0.2]
17 The choice for the subsets Yk leads to di erent variations of PI. [sent-42, score-0.106]
18 In this paper we will focus on two specific variations: 2 Restricting our attention to stationary deterministic policies is not a limitation. [sent-43, score-0.206]
19 Indeed, for the optimality criterion to be defined soon, it can be shown that there exists at least one stationary deterministic policy that is optimal [9]. [sent-44, score-0.369]
20 2 • When for all iterations k, Yk = Sfik , that is one switches the actions in all states with positive advantage with respect to fik , the above algorithm is known as Howard’s PI; it can be seen then that fik+1 œ G(vfik ). [sent-45, score-0.27]
21 • When for all k, Yk is a singleton containing a state ik œ arg maxi afik (i), that is if we only switch one action in the state with maximal advantage with respect to fik , we will call it Simplex-PI3 . [sent-46, score-0.319]
22 Since it generates a sequence of policies with increasing values, any variation of PI converges to the optimal policy in a number of iterations that is smaller than the total number of policies mn . [sent-47, score-0.574]
23 The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and Simplex-PI that are much sharper than mn . [sent-50, score-0.135]
24 In the next sections, we describe some known results—see [11] for a recent and comprehensive review—about the number of iterations required by Howard’s PI and Simplex-PI, along with some of our original improvements and extensions. [sent-51, score-0.085]
25 4 2 Bounds with respect to a Fixed Discount Factor “ < 1 A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequence they generate satisfies some contraction property5 . [sent-52, score-0.122]
26 The sequence (Îvú ≠ vfik ÎŒ )kØ0 built by Howard’s PI is contracting with coe cient “. [sent-56, score-0.132]
27 The sequence (1T (vú ≠ vfik ))kØ0 built by Simplex-PI is contracting with coe cient 1 ≠ 1≠“ . [sent-58, score-0.132]
28 n Though this observation is widely known for Howard’s PI, it was to our knowledge never mentionned explicitly in the literature for Simplex-PI. [sent-59, score-0.089]
29 These contraction properties have the following immediate consequence6 . [sent-60, score-0.077]
30 Let Vmax = maxfi Îrfi ÎŒ be an upper bound on Îvfi ÎŒ for all policies fi. [sent-62, score-0.119]
31 In 1≠“ order to get an ‘-optimal policy, that is a policy fik satisfying Îvú ≠ vfik ÎŒ ÆÏ‘, Howard’s Ï Ì Ì PI requires at most iterations. [sent-63, score-0.264]
32 log Vmax ‘ 1≠“ iterations, while Simplex-PI requires at most n log nVmax ‘ 1≠“ These bounds depend on the precision term ‘, which means that Howard’s PI and SimplexPI are weakly polynomial for a fixed discount factor “. [sent-64, score-0.348]
33 An important breakthrough was recently achieved by [11] who proved that one can remove the dependency with respect to ‘, and thus show that Howard’s PI and Simplex-PI are strongly polynomial for a fixed discount factor “. [sent-65, score-0.262]
34 Simplex-PI and Howard’s PI both terminate after at most n(m ≠ 1 Ï 2Ì 1) n 1≠“ log n2 1≠“ iterations. [sent-67, score-0.076]
35 5 A sequence of non-negative numbers (xk )kØ0 is contracting with coe cient – if and only if for all k Ø 0, xk+1 Æ –xk . [sent-72, score-0.132]
36 Îvú ≠ vfi0 Î1 Æ 1 ≠ 3 > 1≠“ n log " Vmax ‘ . [sent-77, score-0.076]
37 For Simplex-PI, we 1 log “ k nVmax , and the conclusion The proof is based on the fact that PI corresponds to the simplex algorithm in a linear programming formulation of the MDP problem. [sent-78, score-0.169]
38 Howard’s PI terminates after at most (nm + 1) 1≠“ log 1≠“ iterations. [sent-81, score-0.252]
39 Our first two results, that are consequences of the contraction properties (Lemmas 2 and 3), are stated in the following theorems. [sent-82, score-0.077]
40 Howard’s PI terminates after at most n(m ≠ 3 2Ì Ï 1) 1 1≠“ log 1 1≠“ iterations. [sent-84, score-0.252]
41 Simplex-PI terminates after at most n(m ≠ Ï 2Ì n n 1) 1≠“ log 1≠“ iterations. [sent-86, score-0.252]
42 Our result for Simplex-PI is only very slightly better (by a factor 2) than that of [11], and uses a proof that is more direct. [sent-88, score-0.089]
43 Using more refined argument, we managed to also improve the bound for Simplex-PI by a factor O(log n). [sent-89, score-0.059]
44 Simplex-PI terminates after at most n2 (m ≠ 1 2 1) 1 + 2 1≠“ 1 log 1≠“ iterations. [sent-91, score-0.252]
45 Compared to Howard’s PI, our bound for Simplex-PI is a factor O(n) larger. [sent-92, score-0.059]
46 It is easy to see that the linear dependency of the bound for Howard’s PI with respect to n is optimal. [sent-95, score-0.09]
47 We conjecture that the linear dependency of both bounds with respect to 1 m is also optimal. [sent-96, score-0.098]
48 The dependency with respect to the term 1≠“ may be improved, but removing it is impossible for Howard’s PI and very unlikely for Simplex-PI. [sent-97, score-0.088]
49 [2] describes an MDP for which Howard’s PI requires an exponential (in n) number of iterations for “ = 1 and [5] argued that this holds also when “ is in the vicinity of 1. [sent-98, score-0.117]
50 Though a similar result does not seem to exist for Simplex-PI in the literature, [7] consider four variations of PI that all switch one action per iteration, and show through specifically designed MDPs that they may require an exponential (in n) number of iterations when “ = 1. [sent-99, score-0.251]
51 On this topic, [8] recently showed the following result for deterministic MDPs. [sent-101, score-0.083]
52 If the MDP is deterministic, then Simplex-PI terminates after at most O(n5 m2 log2 n) iterations. [sent-103, score-0.176]
53 Given a policy fi of a deterministic MDP, states are either on cycles or on paths induced by fi. [sent-104, score-0.479]
54 The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that significant progress is made every time a new cycle appears; in other words, significant progress is made regularly. [sent-105, score-0.273]
55 If the MDP is deterministic, after at most nmÁ2(n ≠ 1) log nË iterations, either Simplex-PI finishes or a new cycle appears. [sent-107, score-0.13]
56 n 4 Indeed, these observations su ce to prove7 that Simplex-PI terminates after n ˜ O(n4 m2 log 1≠“ ) = O(n4 m2 ). [sent-110, score-0.288]
57 Removing completely the dependency with respect to the 1 discount factor “—the term in O(log 1≠“ )—requires a careful extra work described in [8], which incurs an extra term of order O(n log(n)). [sent-111, score-0.184]
58 For any policy fi and state 1 2 n i, we trivially have xfi (i) œ 1, 1≠“ . [sent-113, score-0.29]
59 The proof exploits the fact that xfi (i) belongs to the 1 n set (1, n) when i is on a path of fi, while xfi (i) belongs to the set ( 1≠“ , 1≠“ ) when i is on a cycle of fi. [sent-114, score-0.107]
60 Given a policy fi of a stochastic MDP, states are either in recurrent classes or transient classes (these two categories respectively generalize those of cycles and paths). [sent-116, score-0.826]
61 Let ·t Ø 1 and ·r Ø 1 be the smallest constants such that for all policies fi and all states i, (1 Æ )xfi (i) Æ ·t 3 4 n n Æ xfi (i) Æ (1 ≠ “)·r 1≠“ if i is transient for fi, and if i is recurrent for fi. [sent-119, score-0.578]
62 ·r ) can be seen as a measure of the time needed to leave transient states (resp. [sent-121, score-0.293]
63 the time needed to revisit states in recurrent classes). [sent-122, score-0.316]
64 If the MDP satisfies Assumption 1, after at most # $ n (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë iterations either Simplex-PI finishes or a new recurrent class appears. [sent-127, score-0.274]
65 If the MDP is aperiodic and irreducible, and thus admits a stationary distribution ‹fi for any policy fi, one can see that 1 = min ‹fi (i). [sent-129, score-0.285]
66 If the MDP satisfies Assumption 1, when Simplex-PI moves from fi to fi Õ where fi Õ involves a new recurrent class, we have 3 4 1 T 1 (vfiú ≠ vfiÕ ) Æ 1 ≠ 1T (vfiú ≠ vfi ). [sent-131, score-0.214]
67 ·r From these generalized observations, we can deduce the following original result. [sent-132, score-0.069]
68 If the MDP satisfies Assumption 1, then Simplex-PI terminates after at most # $ n2 (m ≠ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë iterations. [sent-135, score-0.176]
69 This new result is a strict generalization of the result for deterministic MDPs. [sent-137, score-0.083]
70 Indeed, in the deterministic case, we have ·t Æ n and ·r Æ n, and it is is easy to see that Lemmas 6, 7 and Theorem 7 respectively imply Lemmas 4, 5 and Theorem 6. [sent-138, score-0.083]
71 An immediate consequence of the above result is that Simplex-PI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentionned in Theorem 6. [sent-139, score-0.25]
72 For any family of MDPs indexed by n and m such that ·t and ·r are polynomial functions of n and m, Simplex-PI terminates after a number of steps that is polynomial in n and m. [sent-141, score-0.272]
73 Unfortunately, and as quickly mentionned by [8], the line of analysis developped for Simplex-PI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way that the policy improvement turns out to be small. [sent-144, score-0.437]
74 If the MDP satisfies Assumption 1, after at most nmÁ·t log n·t Ë iterations, either Howard’s PI finishes or a new recurrent class appears. [sent-151, score-0.265]
75 In a recent deterministic example due to [4] to show that Howard’s PI may require at most O(n2 ) iterations, new cycles are created every single iteration but the sequence of values 2 satisfies9 for all iterations k < n + n and states i, 4 4 C 3 4k D 2 vú (i) ≠ vfik+1 (i) Ø 1 ≠ (vú (i) ≠ vfik (i)). [sent-154, score-0.405]
76 n Contrary to Lemma 5, as k grows, the amount of contraction gets (exponentially) smaller and smaller. [sent-155, score-0.077]
77 With respect to Simplex-PI, this suggests that Howard’s PI may su er from subtle specific pathologies. [sent-156, score-0.062]
78 In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. [sent-157, score-0.085]
79 In the simplest—deterministic—case, the question is still open: the currently best known lower bound is the O(n2 ) bound by [4] we have just mentionned, n while the best known upper bound is O( m ) (valid for all MDPs) due to [6]. [sent-159, score-0.069]
80 n 9 This MDP has an even number of states n = 2p. [sent-160, score-0.1]
81 The policies generated by Howard’s PI have values vfik (i) œ (pN ≠k≠1 , pN ≠k ). [sent-163, score-0.096]
82 We deduce that for all iterations k and states i, vú (i)≠vfik+1 (i) vú (i)≠vfik (i) Ø 1+p≠k≠2 1+p≠k =1≠ p≠k ≠p≠k≠2 1+p≠k 6 Ø 1 ≠ p≠k (1 ≠ p≠2 ) Ø 1 ≠ p≠k . [sent-164, score-0.254]
83 The state space X can be partitioned in two sets T and R such that for all policies fi, the states of T are transient and those of R are recurrent. [sent-167, score-0.459]
84 For an MDP satisfying Assumptions 1-2, suppose Howard’s PI moves from fi to fi Õ and that fi Õ involves a new recurrent class. [sent-170, score-0.238]
85 ·r And we can deduce the following original bound (that also applies to Simplex-PI). [sent-172, score-0.092]
86 If the MDP satisfies Assumptions 1-2, then Howard’s PI terminates after at most n(m ≠ 1) (Á·t log n·t Ë + Á·r log n·r Ë) iterations, while Simplex-PI terminates after at most n(m ≠ 1) (Án·t log n·t Ë + Á·r log n·r Ë) iterations. [sent-175, score-0.656]
87 It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). [sent-177, score-1.353]
88 Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. [sent-179, score-0.238]
89 Since vfiú ≠ vfik is non negative, we can take the max norm and get: Îvfiú ≠ vfik ÎŒ Æ “Îvfiú ≠ vfik≠1 ÎŒ . [sent-182, score-0.057]
90 B Contraction property for Simplex-PI (Proof of Lemma 3) By using the fact that {vfi = Tfi vfi ∆ vfi = (I ≠ “Pfi )≠1 rfi }, we have that for all pairs of policies fi and fi Õ . [sent-183, score-0.096]
91 (1) On the one hand, by using this lemma and the fact that {Tfik+1 vfik ≠ vfik Ø 0}, we have for any k: vfik+1 ≠ vfik = (I ≠ “Pk+1 )≠1 (Tfik+1 vfik ≠ vfik ) Ø Tfik+1 vfik ≠ vfik , which implies that 1T (vfik+1 ≠ vfik ) Ø 1T (Tfik+1 vfik ≠ vfik ). [sent-185, score-0.085]
92 From Equation (4), we deduce that for all k, = “ k Î(I ≠ “Pfi0 )≠1 (vú ≠ Tfi0 vú )ÎŒ Æ vú (s0 ) ≠ [Tfik vú ](s0 ) Æ Îvú ≠ Tfik vú ÎŒ Æ “k “k Îvú ≠ Tfi0 vú ÎŒ = (vú (s0 ) ≠ [Tfi0 vú ](s0 )). [sent-191, score-0.069]
93 1≠“ 1≠“ “ As a consequence, the action fik (s0 ) must be di erent from fi0 (s0 ) when 1≠“ < 1, that is for Ï log 1 Ì Ï log 1 Ì 1≠“ all values of k satisfying k Ø k ú = > log1≠“ . [sent-192, score-0.293]
94 In other words, if some policy fi 1 1≠“ k “ is not optimal, then one of its non-optimal actions will be eliminated for good after at most k ú iterations. [sent-193, score-0.326]
95 By repeating this argument, one can eliminate all non-optimal actions (they are at most n(m ≠ 1)), and the result follows. [sent-194, score-0.059]
96 Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. [sent-214, score-0.197]
97 The complexity of policy iteration is exponential for discounted markov decision processes. [sent-228, score-0.392]
98 On the complexity of the policy improvement algorithm for markov decision processes. [sent-239, score-0.309]
99 The simplex method is strongly polynomial for deterministic markov decision processes. [sent-244, score-0.27]
100 The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. [sent-257, score-0.268]
wordName wordTfidf (topN-words)
[('howard', 0.666), ('pi', 0.359), ('policy', 0.24), ('mdp', 0.221), ('transient', 0.193), ('recurrent', 0.189), ('terminates', 0.176), ('states', 0.1), ('vmax', 0.098), ('policies', 0.096), ('mentionned', 0.089), ('lemma', 0.085), ('iterations', 0.085), ('deterministic', 0.083), ('discount', 0.081), ('nm', 0.077), ('contraction', 0.077), ('log', 0.076), ('yk', 0.07), ('deduce', 0.069), ('lemmas', 0.069), ('mdps', 0.066), ('switch', 0.062), ('actions', 0.059), ('nishes', 0.059), ('coe', 0.059), ('ik', 0.058), ('cycles', 0.056), ('cycle', 0.054), ('contracting', 0.054), ('proof', 0.053), ('maxs', 0.051), ('state', 0.05), ('erent', 0.048), ('polynomial', 0.048), ('discounted', 0.045), ('nvmax', 0.044), ('decision', 0.043), ('action', 0.043), ('dependency', 0.041), ('lim', 0.04), ('simplex', 0.04), ('pij', 0.039), ('iteration', 0.038), ('factor', 0.036), ('su', 0.036), ('vicinity', 0.032), ('variations', 0.032), ('bounds', 0.031), ('hansen', 0.031), ('maximal', 0.03), ('strongly', 0.03), ('ak', 0.029), ('seem', 0.029), ('max', 0.029), ('pn', 0.028), ('non', 0.028), ('satis', 0.027), ('stationary', 0.027), ('revisit', 0.027), ('eliminated', 0.027), ('markov', 0.026), ('respect', 0.026), ('di', 0.026), ('equation', 0.025), ('xk', 0.025), ('theorem', 0.025), ('greedy', 0.025), ('moves', 0.025), ('assumption', 0.025), ('created', 0.024), ('satisfying', 0.024), ('classes', 0.024), ('bellman', 0.023), ('bound', 0.023), ('france', 0.023), ('appendix', 0.022), ('facts', 0.022), ('structural', 0.021), ('unlikely', 0.021), ('partitioned', 0.02), ('isaac', 0.02), ('lorraine', 0.02), ('interfere', 0.02), ('zeitschrift', 0.02), ('sequence', 0.019), ('es', 0.019), ('optimal', 0.019), ('mn', 0.019), ('transition', 0.019), ('control', 0.018), ('icalp', 0.018), ('manage', 0.018), ('aperiodic', 0.018), ('tor', 0.018), ('operator', 0.017), ('open', 0.017), ('material', 0.017), ('regularly', 0.017), ('scherrer', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
2 0.22266586 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
3 0.21611799 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
4 0.16832037 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari
Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set ⇧ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. 1
6 0.15875049 348 nips-2013-Variational Policy Search via Trajectory Optimization
7 0.15273553 347 nips-2013-Variational Planning for Graph-based MDPs
8 0.14566089 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
9 0.13952589 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
10 0.13744472 241 nips-2013-Optimizing Instructional Policies
11 0.13731724 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
12 0.12801975 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
13 0.12373927 257 nips-2013-Projected Natural Actor-Critic
14 0.12363046 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
15 0.12312374 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
16 0.12119008 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
17 0.11502854 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
18 0.11250609 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
19 0.10583612 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
20 0.09758848 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
topicId topicWeight
[(0, 0.179), (1, -0.251), (2, -0.099), (3, 0.128), (4, -0.013), (5, 0.02), (6, -0.066), (7, 0.019), (8, -0.008), (9, 0.012), (10, 0.104), (11, -0.027), (12, 0.023), (13, 0.036), (14, 0.076), (15, -0.032), (16, 0.027), (17, -0.015), (18, -0.029), (19, 0.04), (20, 0.011), (21, 0.007), (22, -0.046), (23, 0.002), (24, -0.042), (25, -0.009), (26, 0.018), (27, -0.032), (28, -0.071), (29, -0.056), (30, -0.051), (31, -0.029), (32, -0.029), (33, 0.06), (34, -0.035), (35, 0.005), (36, -0.032), (37, 0.084), (38, 0.003), (39, -0.057), (40, 0.01), (41, 0.055), (42, -0.107), (43, -0.012), (44, -0.051), (45, -0.018), (46, 0.09), (47, 0.032), (48, -0.124), (49, -0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.95630962 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
2 0.78778511 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
3 0.74708861 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.72897267 347 nips-2013-Variational Planning for Graph-based MDPs
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
5 0.70888382 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
6 0.70089304 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
7 0.66754639 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
8 0.63235921 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
9 0.63163269 241 nips-2013-Optimizing Instructional Policies
10 0.62679023 257 nips-2013-Projected Natural Actor-Critic
11 0.62378377 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
12 0.62217331 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
13 0.61574954 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
14 0.60806245 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
15 0.59643066 348 nips-2013-Variational Policy Search via Trajectory Optimization
16 0.5739131 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
17 0.56169164 165 nips-2013-Learning from Limited Demonstrations
18 0.55477607 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
20 0.52954847 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
topicId topicWeight
[(16, 0.015), (21, 0.012), (33, 0.101), (34, 0.103), (41, 0.02), (49, 0.012), (56, 0.15), (70, 0.026), (85, 0.396), (89, 0.025), (93, 0.018), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.93332088 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
Author: Tai Qin, Karl Rohe
Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models. 1
2 0.91290325 7 nips-2013-A Gang of Bandits
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
3 0.85387135 332 nips-2013-Tracking Time-varying Graphical Structure
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
same-paper 4 0.85079205 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
5 0.84451735 168 nips-2013-Learning to Pass Expectation Propagation Messages
Author: Nicolas Heess, Daniel Tarlow, John Winn
Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1
6 0.74813455 89 nips-2013-Dimension-Free Exponentiated Gradient
7 0.71952587 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
8 0.70724803 196 nips-2013-Modeling Overlapping Communities with Node Popularities
9 0.69040555 232 nips-2013-Online PCA for Contaminated Data
10 0.67725182 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
11 0.67356014 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
12 0.66432017 289 nips-2013-Scalable kernels for graphs with continuous attributes
13 0.66220289 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
14 0.64329022 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
15 0.63419592 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
16 0.63078082 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
17 0.62739593 25 nips-2013-Adaptive Anonymity via $b$-Matching
19 0.61609161 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
20 0.61192471 79 nips-2013-DESPOT: Online POMDP Planning with Regularization