jmlr jmlr2012 jmlr2012-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
Reference: text
sentIndex sentText sentNum sentScore
1 DPP is an incremental algorithm that forces a gradual change in policy update. [sent-12, score-0.377]
2 Introduction Many problems in robotics, operations research and process control can be represented as a control problem that can be solved by finding the optimal policy using dynamic programming (DP). [sent-18, score-0.56]
3 Examples of such a approximate dynamic programming (ADP) methods are approximate policy iteration (API) and approximate value iteration (AVI) (Bertsekas, 2007; Lagoudakis and Parr, 2003; Perkins and Precup, 2003; de Farias and Van Roy, 2000). [sent-22, score-0.63]
4 , 1983), which explicitly consider two interacting processes, policy gradient methods (Baxter and Bartlett, 2001; Sutton et al. [sent-25, score-0.377]
5 the state-action pair (x, a) and πk is the control policy at iteration k. [sent-36, score-0.442]
6 In this paper, we propose a new incremental policy-iteration algorithm called dynamic policy programming (DPP). [sent-48, score-0.479]
7 We also introduce a new RL algorithm based on the DPP update rule, called DPP-RL, and prove that it √ converges to the optimal policy with the convergence rate of order 1/ k. [sent-52, score-0.484]
8 This rate of convergence leads to a PAC (“probably approximately correct”) sample-complexity bound of order O(1/((1 − γ)6 ε2 )) to find an ε-optimal policy with high probability, which is superior to the best existing result of standard Q-learning (Even-Dar and Mansour, 2003). [sent-53, score-0.42]
9 In the case of API, εk is the policy evaluation error (see Farahmand et al. [sent-57, score-0.377]
10 (1983), since both methods make use of an approximation of the optimal policy by means of action preferences and soft-max policy. [sent-60, score-0.57]
11 However, DPP uses a different update rule which is only expressed in terms of the action preferences and does not rely on the estimate of the value function to criticize the control policy. [sent-61, score-0.262]
12 The contribution of this work is mainly theoretical, and focused on the problem of estimating the optimal policy in an infinite-horizon MDP. [sent-62, score-0.404]
13 The transition P is a probability kernel over the next state upon taking action a from state x, which we shall denote by P(·|x, a). [sent-87, score-0.22]
14 A reward r(x, a) ∈ R is associated with each state x and action a. [sent-89, score-0.217]
15 A Markovian policy kernel determines the distribution of the control action given the current state. [sent-94, score-0.514]
16 The policy is called stationary if the distribution of the control action is independent of time. [sent-95, score-0.514]
17 A policy is called deterministic if for any state x ∈ X there exists some action a such that π(a|x) = 1. [sent-97, score-0.527]
18 Given the policy π, its corresponding value function V π : X → R denotes the expected total discounted reward in each state x, when the action is chosen by policy π, which we denote by V π (x). [sent-98, score-0.971]
19 Therefore, we introduce Qπ : Z → R as the expected total discounted reward upon choosing action a from state x and then following policy π, which we shall denote by Qπ (x, a). [sent-100, score-0.594]
20 P(y|x, a)π(b|y)Q(y, b), (y,b)∈Z supπ V π (x), at all The goal is to find a policy π∗ that attains the optimal value function, V ∗ (x) states x ∈ X. [sent-102, score-0.43]
21 In other words, for any two real-valued action-value functions Q and Q′ and every policy π, we have T π Q − T π Q′ ≤ γ Q − Q′ . [sent-111, score-0.377]
22 TQ − TQ′ ≤ γ Q − Q′ , The policy distribution π defines a right-linear operator Pπ · as (Pπ Q)(x, a) ∑ π(b|y)P(y|x, a)Q(y, b), (y,b)∈Z ∀(x, a) ∈ Z. [sent-112, score-0.411]
23 DYNAMIC P OLICY P ROGRAMMING We note that for every Q : Z → R, V : X → R and policy π, we have (π[Q +V ])(x) = (πQ)(x) +V (x), π π ∀x ∈ X, (P [Q +V ])(x, a) = (P Q)(x, a) + (PV )(x, a), ∀(x, a) ∈ Z. [sent-116, score-0.377]
24 We first present the dynamic policy programming (DPP) algorithm in Section 3. [sent-123, score-0.479]
25 1 Algorithm DPP is a policy iteration algorithm which represents the policy πk in terms of some action preference numbers Ψk (Sutton and Barto, 1998, Chapter 2. [sent-128, score-0.902]
26 Starting at Ψ0 , DPP iterates the action preferences of all state-action pairs (x, a) ∈ Z through the DPP operator O (the pseudo code of DPP is presented in Algorithm 1): Ψk+1 (x, a) = OΨk (x, a) Ψk (x, a) − (Mη Ψk )(x) + r(x, a) + γ(PMη Ψk )(x, a), where Mη denotes the softmax operator. [sent-130, score-0.2]
27 The control policy πk is then computed as a function of Ψk at each iteration k: πk (a|x) = exp(ηΨk (x, a)) , ∑ exp(ηΨk (x, b)) ∀(x, a) ∈ Z. [sent-132, score-0.442]
28 Also, assume that Ψ0 is uniformly bounded by Vmax for all (x, a) ∈ Z, then the following inequality holds for the policy induced by DPP at iteration k ≥ 0: Q∗ − Qπk ≤ 2γ 4Vmax + log(|A|) η (1 − γ)2 (k + 1) . [sent-140, score-0.441]
29 Note that the DPP algorithm converges to the optimal policy for every η > 0 and choosing a different η only changes the rate of convergence. [sent-143, score-0.436]
30 The best rate of convergence is achieved by setting η = ∞, for which the softmax policy and the softmax operator Mη are replaced with the greedy policy and the max-operator M, respectively. [sent-144, score-0.809]
31 In words, the policy induced by DPP asymptotically converges to the optimal policy π∗ . [sent-151, score-0.862]
32 The following corollary shows that there exists a unique limit for the action preferences in infinity if the optimal policy π∗ is unique. [sent-152, score-0.57]
33 Assume that the optimal policy π∗ is unique and let Ψk (x, a), for all (x, a) ∈ Z, be the action preference after k iteration of DPP. [sent-154, score-0.552]
34 Notice that the assumption on the uniqueness of the optimal policy π∗ is not required for the main result of this section (Theorem 2). [sent-158, score-0.404]
35 Also, the fact that in Corollary 4 the action preferences of sub-optimal actions tend to −∞ is the natural consequence of the convergence of πk to the optimal policy π∗ , which forces the probability of the sub-optimal actions to be 0. [sent-159, score-0.649]
36 Also, to compute the optimal policy by DPP an explicit knowledge of the model is required. [sent-162, score-0.404]
37 Instead it may be possible to simulate the state transition by Monte-Carlo sampling and then estimate the optimal policy using these samples. [sent-164, score-0.495]
38 j=0 Then the following inequality holds for the policy induced by approximate DPP at round k: k 2γ 4Vmax + log(|A|) η 1 Q∗ − Qπk ≤ + ∑ γk− j E j . [sent-182, score-0.428]
39 noise and asymptotically converges to the optimal policy whereas there is no guarantee, in this case, for the convergence of API and AVI to the optimal solution. [sent-202, score-0.507]
40 We prove, in the next subsection, that DPP-RL satisfies this assumption and, therefore, asymptotically converges to the optimal policy (see Theorem 9). [sent-211, score-0.459]
41 2 Reinforcement Learning with Dynamic Policy Programming To compute the optimal policy by DPP one needs an explicit knowledge of the model. [sent-213, score-0.404]
42 The optimal policy can then be learned using these samples. [sent-215, score-0.404]
43 (9) Based on Equation 9, we estimate the optimal policy by iterating some initial Ψ0 through the DPP-RL update rule, where at each iteration we draw yk for every (x, a) ∈ Z. [sent-219, score-0.499]
44 However, the new algorithm still converges to the optimal policy since one can show that the errors associated with approximating Equation 5 are asymptotically averaged out by DPP-RL, as postulated by Corollary 6. [sent-224, score-0.459]
45 Assume that the initial action-value function Ψ0 is uniformly bounded by Vmax and πk is the policy induced by Ψk after k iteration of DPP-RL. [sent-232, score-0.441]
46 We also prove the following result on the converge rate of DPP-RL to the optimal policy by making use of the result of Theorem 5: Theorem 10 (Finite-time high-probability loss-bound of DPP-RL) Let Assumption 1 hold and k be a positive integer and 0 < δ < 1. [sent-238, score-0.404]
47 Algorithm 3 presents the sampling-based approximate dynamic policy programming (SADPP) in which we rely on Equation 10 to approximate DPP operator at each iteration. [sent-291, score-0.584]
48 We compare it with a synchronous variant of Q-learning 3218 DYNAMIC P OLICY P ROGRAMMING Algorithm 3: (SADPP) Sampling-based approximate dynamic policy programming 1 2 3 4 5 6 7 8 Input: θ0 , η, γ, α, K and N for k = 0, 1, 2, . [sent-296, score-0.504]
49 , 2008) and regularized least-squares policy iteration (REG-LSPI) (Farahmand et al. [sent-308, score-0.415]
50 The transition probability from an interior state xk to any other state xl is inversely proportional to the distance in the direction of the selected action. [sent-319, score-0.223]
51 States x1 and x2500 are the two absorbing states and state xk is an example of interior state. [sent-328, score-0.211]
52 ∑ n(xm , a, xk ) P(xl |xk , a) = xm ∈X Transitions to an absorbing state have associated reward 1 and transitions to any interior state has associated reward −1. [sent-333, score-0.359]
53 The optimal policy corresponding to this problem is to reach the closest absorbing state as soon as possible. [sent-334, score-0.517]
54 In this problem, however, there is only one absorbing state (corresponding to the state lock-opened) with associated reward of 1. [sent-340, score-0.22]
55 Otherwise, if at some state xk , k < 2500, action −1 is taken, the lock automatically resets to some previous state xl , l < k randomly (in the original problem, the reset state is always the initial state x1 ). [sent-345, score-0.434]
56 The transition probability upon taking the wrong action −1 from state xk to state xl is P(xl |xk , −1), as before, inversely proportional to the distance of the states. [sent-348, score-0.333]
57 From xk any previous state is reachable with transition probability (arrow thickness) proportional to the inverse of the distance to xk . [sent-353, score-0.237]
58 Although the state space of the grid world is of the same size as the previous two problems, |X| = 2500, the joint action state space is larger, |Z| = 104 . [sent-359, score-0.22]
59 This means that both the top-left absorbing state and the central state have the least possible reward (−1), and that the remaining absorbing states have reward which increases proportionally to the distance to the state in the bottom-right corner (but are always negative). [sent-364, score-0.426]
60 The transition probabilities are defined in the following way: taking action a from any nonabsorbing state x results in a one-step transition in the direction of action a with probability 0. [sent-365, score-0.32]
61 3221 ´ G HESHLAGHI A ZAR , G OMEZ AND K APPEN The resulting optimal policy is to survive in the grid as long as possible by avoiding both the absorbing walls and the center of the grid. [sent-368, score-0.533]
62 1 E XPERIMENTAL S ETUP AND R ESULTS For consistency with the theoretical results, we evaluate the performance of all algorithms in terms of ℓ∞ -norm error of the action-value function Q∗ − Qπk obtained by policy πk induced at iteration k. [sent-372, score-0.441]
63 For consistency with the theoretical results, we use as error measure, the distance between the optimal action-value function and the value function of the policy induced by the algorithms. [sent-389, score-0.43]
64 , 2009): It can be regarded as a Monte-Carlo sampling implementation of approximate policy iteration (API) with action-state representation (see also Lagoudakis and Parr, 2003). [sent-471, score-0.461]
65 We also fix an upper bound for ¯ the states, xmax = 10 and modify the problem definition such that if the next state y happens to be outside of the domain [0, xmax ] then the product is replaced immediately, and a new state is drawn as if action a = 1 were chosen in the previous time step. [sent-490, score-0.212]
66 We measure the performance loss of the algorithms in terms of the difference between the optimal action a∗ and the action selected by the algorithms. [sent-491, score-0.247]
67 6), which makes use of a separate structure to store the value function (critic) and the control policy (actor). [sent-544, score-0.404]
68 In PGAC, the actor updates the parameterized policy in the direction of the (natural) gradient of performance, provided by the critic. [sent-547, score-0.377]
69 The parameter η in DPP is reminiscent of the learning step β in PGAC methods, since it influences the rate of change of the policy and in this sense may play a similar role as the learning step β in PGAC (Konda and Tsitsiklis, 2003; Peters and Schaal, 2008). [sent-551, score-0.377]
70 This is in a contrast to our previously mentioned result on the convergence of DPP-RL in Theorem 10 which guarantees that, regardless of the value of η and γ, DPP-RL always converges to the √ optimal policy with a rate of order 1/ k. [sent-566, score-0.457]
71 √ DPP-RL, SQL converges to the optimal policy with the rate Like of convergence of order 1/ k. [sent-583, score-0.457]
72 Also, the natural policy gradient method can be seen as a relative entropy method, in which the second-order Taylor expansion of the relative-entropy between the distribution of the states is considered as the metric for policy improvement (Bagnell and Schneider, 2003). [sent-591, score-0.78]
73 Another relevant study is relative entropy policy search (REPS) (Daniel et al. [sent-592, score-0.377]
74 , 2010) which relies on the idea of minimizing the relative entropy to control the size of policy update. [sent-594, score-0.404]
75 Discussion and Future Work We have presented a new approach, dynamic policy programming (DPP), to compute the optimal policy in infinite-horizon discounted-reward MDPs. [sent-598, score-0.883]
76 We have theoretically proven the convergence of DPP to the optimal policy for the tabular case. [sent-599, score-0.425]
77 However, the latter can be easily derived for QL from the inequality Q∗ − Qπk ≤ 1/(1 − γ) Q∗ − Qk , where πk is the greedy policy w. [sent-603, score-0.377]
78 We have proven that DPP-RL converges to the optimal policy with the rate of 1/ k. [sent-610, score-0.436]
79 These methods include, for instance, PGAC methods that use sequences of samples to learn the value function of the current policy (Peters and Schaal, 2008; Konda and Tsitsiklis, 2003; Sutton et al. [sent-623, score-0.399]
80 One idea is to sample estimate Mη Ψk (x) using Monte-Carlo simulation (MacKay, 2003, Chapter 29), since Mη Ψk (x) is the expected value of Ψk (x, a) under the soft-max policy πk . [sent-632, score-0.377]
81 Consider the relative entropy ¯ between the policy π and some baseline policy π: gπ (x) ¯ π ∑ π(a|x) log ¯ KL (π(·|x) π(·|x)) = a∈A π(a|x) , ¯ π(a|x) ∀x ∈ X. [sent-648, score-0.779]
82 the state transition probability distribution P and the policy π. [sent-654, score-0.447]
83 log ¯ ¯ π(a|x) η (13) Equation 13 is a modified version of Equation 2 where, in addition to maximizing the expected ¯ ¯ reward, the optimal policy π∗ also minimizes the distance with the baseline policy π. [sent-657, score-0.806]
84 ∗ ¯ The optimal policy π∗ is a function of the base policy, the optimal value function Vπ and the state ¯ ∗ through the following transition probability P. [sent-671, score-0.501]
85 The idea to further improve the policy towards π∗ is to replace the base-line policy with the just π newly computed policy of Equation 14. [sent-675, score-1.131]
86 The new policy can be regarded as a new base-line policy, and the process can be repeated again. [sent-676, score-0.377]
87 This leads to a double-loop algorithm to find the optimal policy π∗ , where the outer-loop and the inner-loop would consist of a policy update, Equation 14, and a value function update, Equation 18, respectively. [sent-677, score-0.781]
88 Replacing Lη with Mη is motivated by the following relation between these two operators: |Lη Ψ(x) − Mη Ψ(x)| = 1/ηHπ (x) ≤ log(|A|) , η ∀x ∈ X, (25) with Hπ (x) is the entropy of the policy distribution π obtained by plugging Ψ into Equation A. [sent-689, score-0.377]
89 1 Proof of Theorem 2 In this subsection we establish a rate of convergence for the value function of the policy induced by DPP. [sent-696, score-0.424]
90 (1 − γ)2 (k + 1) (26) Here, Qπk is the action-values under the policy πk and πk is the policy induced by DPP at step k. [sent-698, score-0.78]
91 } is obtained by iterating the initial Q0 = Ψ0 from the following recursion: Qk = k − 1 πk−1 1 T Qk−1 + T πk−1 Q0 , k k (27) where πk is the policy induced by the kth iterate of DPP. [sent-710, score-0.403]
92 This combined with the assumption that the optimal policy is unique completes the proof. [sent-731, score-0.404]
93 Proof of Theorem 5 This section provides a formal theoretical analysis of the performance of dynamic policy programming in the presence of approximation. [sent-733, score-0.503]
94 Our goal is to establish an ℓ∞ -norm performance loss bound of the policy induced by approximate DPP. [sent-738, score-0.45]
95 Here, Qπk denotes the j=0 action-value function of the policy πk and πk is the soft-max policy associated with Ψk . [sent-740, score-0.754]
96 Lemma 15 relates Ψk with Qk : Lemma 15 Let k be a positive integer and πk denotes the policy induced by the approximate DPP at iteration k. [sent-750, score-0.466]
97 Based on Lemma 15, one can express the policy induced by DPP, πk , in terms of Qk and Q0 : Lemma 16 For all (x, a) ∈ Z: πk (a|x) = Proof exp (η (kQk (x, a) + Q0 (x, a))) . [sent-758, score-0.448]
98 1 Proof of Theorem 9 In this subsection, we provide the proof of Theorem 9 which guarantees that DPP-RL asymptotically converges to the optimal policy w. [sent-802, score-0.459]
99 Error propagation for approximate policy and value a iteration. [sent-977, score-0.402]
100 Least-squares lambda policy iteration: Bias-variance trade-off in control problems. [sent-1152, score-0.404]
wordName wordTfidf (topN-words)
[('dpp', 0.594), ('policy', 0.377), ('qk', 0.324), ('sadpp', 0.211), ('kqk', 0.172), ('ql', 0.147), ('appen', 0.125), ('heshlaghi', 0.125), ('omez', 0.125), ('rogramming', 0.125), ('zar', 0.125), ('action', 0.11), ('szepesv', 0.108), ('olicy', 0.107), ('pv', 0.106), ('rfqi', 0.106), ('ek', 0.095), ('vmax', 0.084), ('dynamic', 0.08), ('absorbing', 0.073), ('xk', 0.072), ('reward', 0.067), ('tk', 0.063), ('farahmand', 0.062), ('reinforcement', 0.061), ('munos', 0.061), ('avi', 0.059), ('peters', 0.059), ('preferences', 0.056), ('tq', 0.056), ('mdp', 0.055), ('bellman', 0.052), ('lock', 0.051), ('exp', 0.045), ('rl', 0.044), ('vi', 0.043), ('mansour', 0.042), ('equation', 0.042), ('xl', 0.041), ('state', 0.04), ('azar', 0.04), ('pgac', 0.04), ('supremum', 0.039), ('lim', 0.038), ('iteration', 0.038), ('foreach', 0.036), ('ri', 0.035), ('api', 0.035), ('tsitsiklis', 0.035), ('operator', 0.034), ('sutton', 0.034), ('converges', 0.032), ('bertsekas', 0.031), ('xn', 0.031), ('transition', 0.03), ('grid', 0.03), ('deduce', 0.03), ('yk', 0.03), ('barto', 0.029), ('actions', 0.029), ('konda', 0.028), ('asymptotic', 0.028), ('pac', 0.027), ('optimal', 0.027), ('update', 0.027), ('control', 0.027), ('induced', 0.026), ('ktqk', 0.026), ('sql', 0.026), ('walls', 0.026), ('states', 0.026), ('accumulated', 0.026), ('lemma', 0.026), ('ghavamzadeh', 0.025), ('approximate', 0.025), ('singh', 0.025), ('log', 0.025), ('presence', 0.024), ('hp', 0.023), ('reachable', 0.023), ('asymptotically', 0.023), ('watkins', 0.023), ('tted', 0.022), ('programming', 0.022), ('samples', 0.022), ('recursion', 0.022), ('bound', 0.022), ('yn', 0.022), ('convergence', 0.021), ('pr', 0.021), ('sampling', 0.021), ('rely', 0.021), ('rule', 0.021), ('martingale', 0.021), ('parr', 0.02), ('actionvalue', 0.02), ('farias', 0.02), ('gheshlaghi', 0.02), ('lfa', 0.02), ('maei', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
2 0.16145511 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
3 0.14322579 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
4 0.096758373 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy
Abstract: We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation
5 0.078056417 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
6 0.068521358 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
7 0.066916704 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
8 0.047956806 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
9 0.043953538 50 jmlr-2012-Human Gesture Recognition on Product Manifolds
10 0.041117031 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
11 0.039999235 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
12 0.036054522 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
13 0.03561509 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
14 0.03560916 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
15 0.035232406 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
16 0.034926396 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
17 0.030709282 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
18 0.030474972 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
19 0.029372791 59 jmlr-2012-Linear Regression With Random Projections
20 0.028759416 91 jmlr-2012-Plug-in Approach to Active Learning
topicId topicWeight
[(0, -0.154), (1, -0.01), (2, 0.025), (3, -0.179), (4, 0.057), (5, -0.23), (6, -0.231), (7, -0.172), (8, 0.028), (9, 0.164), (10, -0.204), (11, 0.134), (12, 0.063), (13, -0.121), (14, 0.053), (15, -0.062), (16, 0.034), (17, -0.008), (18, -0.091), (19, 0.132), (20, -0.0), (21, -0.028), (22, 0.102), (23, 0.064), (24, -0.067), (25, -0.086), (26, 0.083), (27, -0.18), (28, 0.075), (29, -0.028), (30, 0.024), (31, -0.063), (32, 0.024), (33, -0.072), (34, 0.094), (35, -0.08), (36, -0.059), (37, 0.087), (38, -0.038), (39, 0.012), (40, 0.003), (41, 0.025), (42, -0.11), (43, -0.073), (44, -0.002), (45, -0.129), (46, 0.112), (47, 0.047), (48, -0.039), (49, -0.075)]
simIndex simValue paperId paperTitle
same-paper 1 0.93951517 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
2 0.77811015 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
3 0.6025089 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
4 0.31539902 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
5 0.3147366 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy
Abstract: We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation
6 0.30963066 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
7 0.28531289 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
8 0.28372467 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
9 0.22856691 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
10 0.22076137 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
11 0.21067305 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
12 0.20737745 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
13 0.20200135 59 jmlr-2012-Linear Regression With Random Projections
15 0.18406878 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
16 0.18104842 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
17 0.17738348 111 jmlr-2012-Structured Sparsity and Generalization
18 0.17370069 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
19 0.16847868 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
20 0.16050668 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
topicId topicWeight
[(7, 0.012), (16, 0.019), (21, 0.034), (26, 0.054), (29, 0.038), (35, 0.012), (49, 0.014), (56, 0.012), (57, 0.028), (75, 0.036), (79, 0.021), (92, 0.199), (96, 0.081), (97, 0.332)]
simIndex simValue paperId paperTitle
same-paper 1 0.75032151 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
2 0.56131059 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
3 0.55855107 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
Author: Alain Hauser, Peter Bühlmann
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search
5 0.55613041 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
6 0.55224264 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
7 0.55180633 59 jmlr-2012-Linear Regression With Random Projections
8 0.5314244 111 jmlr-2012-Structured Sparsity and Generalization
9 0.5244413 82 jmlr-2012-On the Necessity of Irrelevant Variables
10 0.52060145 73 jmlr-2012-Multi-task Regression using Minimal Penalties
11 0.51927966 68 jmlr-2012-Minimax Manifold Estimation
12 0.5163005 80 jmlr-2012-On Ranking and Generalization Bounds
13 0.51537389 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
14 0.51368439 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
15 0.51208657 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
16 0.51082408 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
17 0.5096311 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
18 0.50734073 13 jmlr-2012-Active Learning via Perfect Selective Classification
19 0.50697118 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
20 0.50672388 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions