jmlr jmlr2012 jmlr2012-51 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
Reference: text
sentIndex sentText sentNum sentScore
1 Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. [sent-12, score-0.215]
2 We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. [sent-15, score-0.24]
3 Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. [sent-16, score-0.277]
4 Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms 1. [sent-17, score-0.209]
5 Based on this model, a planning problem is solved where techniques from Dynamic Programming (Bertsekas, 2006) are applied in order to find the optimal policy function. [sent-23, score-0.179]
6 On the other hand, within the model free setting, the agent does not try to build a model of the MDP, but rather attempts to find the optimal policy by directly mapping environmental c 2012 Aviv Tamar, Dotan Di Castro and Ron Meir. [sent-24, score-0.381]
7 While it can be shown that both approaches, under mild conditions, asymptotically reach the same optimal policy on typical MDP’s, it is known that each approach possesses distinct merits. [sent-27, score-0.179]
8 Model based methods often make better use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. [sent-28, score-0.24]
9 In this work we pursue such a hybrid approach applicable to cases where partial model information is available in a specific form which we term partially known MDP. [sent-34, score-0.216]
10 Our theoretical analysis focuses on a particular model free algorithm the well known TD(0) policy evaluation algorithm, and we prove that our hybrid method leads to improved performance, as long as sufficiently accurate partial knowledge is available. [sent-40, score-0.427]
11 In the first, we apply it to a policy gradient type algorithm, and investigate its performance in randomly generated MDPs. [sent-42, score-0.179]
12 As it turns out, our partially known MDP definition is a natural choice for describing plausible partial knowledge in such problems, and performance improvement is demonstrated for a Q-learning algorithm. [sent-44, score-0.218]
13 In this work we define a partially known MDP, and use this partial knowledge to improve the asymptotic performance of stochastic approximation type algorithms. [sent-56, score-0.282]
14 Thus, the partial model which we use only to reduce stochastic fluctuations may further be used to explore or exploit more efficiently. [sent-59, score-0.187]
15 In Section 4 we analyze the asymptotic fluctuations in a fixed step TD(0) algorithm with a partial model. [sent-64, score-0.208]
16 This model is then used to generate simulated trajectories which are fed to the same model free algorithm for further policy improvement. [sent-69, score-0.246]
17 (2006), a hybrid approach is proposed that combines policy search on an inaccurate model, with policy evaluations in the real environment. [sent-71, score-0.418]
18 In Section 5 we investigate the effects of inaccuracies in the partial model, and extend our results to inaccurate partially known MDP’s. [sent-83, score-0.21]
19 Estimation of a Random Variable Mean with Partial Knowledge Our method of using partial knowledge in an SA algorithm is based on constructing a better estimator for the mean update at each step. [sent-87, score-0.204]
20 Our estimator can be any function of x, and of values and probability ratios in the partial knowledge set K. [sent-99, score-0.23]
21 ˆ (7) When no partial information is available, µ seems like the most reasonable choice (actually, it can ˆ be shown that µ is the only unbiased estimator in that case). [sent-121, score-0.23]
22 A Stochastic Approximation Algorithm with Partial Model Knowledge In this section we describe our method of endowing a model free RL algorithm with partial model knowledge. [sent-131, score-0.215]
23 Then, we consider a situation where partial knowledge of the environment model is available. [sent-133, score-0.196]
24 Each selected action u ∈ U at a state x ∈ X determines a stochastic transition to the next state y ∈ X with a probability Pu (y|x). [sent-150, score-0.18]
25 For each state x the agent receives a corresponding deterministic reward r(x), which is bounded by rmax , and depends only on the current state. [sent-151, score-0.22]
26 2 The agent maintains a policy function, µθ (u|x), parametrized by a vector θ ∈ RL , mapping a state x into a probability distribution over the actions U . [sent-152, score-0.311]
27 Under policy µθ , the environment and the agent induce a Markovian transition matrix, denoted by Pµθ , which we assume to be ergodic. [sent-153, score-0.349]
28 At time n, the current parameter value equals θn and the agent is in state xn . [sent-163, score-0.473]
29 It then chooses an action un according to µθn (u|xn ), observes xn+1 , and updates θn+1 according to some protocol. [sent-164, score-0.215]
30 The algorithms that we deal with in this paper are all cast in the following SA form,4 θn+1 = θn + εn F (θn , xn , un , xn+1 ) , (8) where {εn } are positive step sizes. [sent-170, score-0.553]
31 2 Partial Model Based Algorithm A key observation obtained from examining Equations (8-9), is that F (θn , xn , un , xn+1 ) in the SA algorithm is just the sample estimator (7) of g (θn , xn , un ), the mean update at each step. [sent-194, score-1.162]
32 The estimation variance in this case stems from the stochastic transition from xn to xn+1 . [sent-195, score-0.49]
33 In the following we assume that we have, prior to running the algorithm, some information about these transitions in the form of partial transition probability ratios. [sent-196, score-0.223]
34 (13) y∈Kxn ,un Similarly to (9), iterate (12) can also be decomposed into a mean function gK (θK , xn , un ) and a n K martingale difference noise δMn K θK = θK + εn (gK (θK , xn , un ) + δMn ) , n+1 n n and by Lemma 1 we have gK (θ, x, u) = g(θ, x, u). [sent-205, score-1.146]
35 1 Definitions Throughout this section, we assume that the agent’s policy µ is deterministic and fixed, mapping a specific action to each state, denoted by u (x). [sent-234, score-0.208]
36 1 VALUE F UNCTION E STIMATION Letting 0 < γ < 1 denote a discount factor, define the value function for state x under policy µ as the expected discounted return when starting from state x and executing policy µ V µ (x) ∞ E ∑ γt r(xt ) x0 = x . [sent-237, score-0.422]
37 t=0 Since in this section the policy µ is constant, from now on we omit the superscript µ in V µ (x), and the subscript µ in Pµ , πµ , and Πµ . [sent-238, score-0.179]
38 Since the policy is deterministic we drop the u subscript in the known set definition. [sent-258, score-0.179]
39 Using (12) and (13) we define IPM-TD(0) K K θK n+1 = θn + εdn φ (xn ) , K dn ¯ r (xn ) + γ 1K FnK + 1K φ(xn+1 )T θK − φ(xn )T θK , n n+1 n n+1 (17) ∑ P ( y| xn ) φ(y)T θK n FnK y∈Kxn ∑ P ( y| xn ) . [sent-259, score-0.89]
40 After establishing that the asymptotic trajectory (or, in other words the algorithmic ‘function’) of the algorithm remains intact, we shall now investigate whether adding the partial knowledge can be guaranteed to improve performance. [sent-261, score-0.24]
41 The remainder of this section is devoted to showing that integrating a partial model reduces the asymptotic MSE, namely lim E θK − θ∗ 2 < lim E θn − θ∗ 2 , n n→∞ n→∞ whenever the known set K is not null. [sent-267, score-0.306]
42 By Lemma 2, at each iteration step we are guaranteed (as long as our partial model is not null) a reduction in the noise variance. [sent-268, score-0.181]
43 j=1 K For the IPM iteration (17) we have ΣK , ΣK where dn replaces dn in (20). [sent-278, score-0.345]
44 1 1938 I NTEGRATING A PARTIAL M ODEL INTO M ODEL F REE RL Proof For a table based case, θ∗ satisfies Bellman’s equation for a fixed policy (Bertsekas and Tsitsiklis, 1996) (24) θ∗ (x) = r (x) + γE θ∗ x′ x . [sent-304, score-0.179]
45 Now, for every j we have E (dn φ (xn )) (dn+j φ (xn+j ))T θn = θn+j = θ∗ = E [(r (xn ) + γθ∗ (xn+1 ) − θ∗ (xn )) (r (xn+ j ) + γθ∗ (xn+ j+1 ) − θ∗ (xn+ j ))] = E E (r (xn ) + γθ∗ (xn+1 ) − θ∗ (xn )) (r (xn+ j ) + γθ∗ (xn+ j+1 ) − θ∗ (xn+ j )) xn , . [sent-305, score-0.367]
46 Assuming that n there is at least one state x ∈ X such that P (Kx ) VarK [ θ∗ (x′ )| x] > 0, then the asymptotic MSE of the iterates satisfy lim E θK − θ∗ 2 = lim E θn − θ∗ 2 − δMSE , where δMSE is given in (25), and n n→∞ n→∞ δMSE > 0. [sent-320, score-0.19]
47 For a GARNET(10, 5, 10, 1) MDP, a random deterministic policy was chosen and its value function was evaluated using algorithm (17). [sent-343, score-0.179]
48 In Figure 1 (middle), the step size for an iteration with partial knowledge was set such that the asymptotic MSE would match that of the iteration without partial knowledge. [sent-346, score-0.422]
49 1940 I NTEGRATING A PARTIAL M ODEL INTO M ODEL F REE RL 2 25 2 pk = 0 pk = 0. [sent-360, score-0.378]
50 This is no longer valid when the partial model is not accurate, as the inaccuracy induces a bias in µK . [sent-395, score-0.214]
51 ˆ We shall see that if the inaccuracy in the partial model is small enough, then this property can still be guaranteed. [sent-398, score-0.186]
52 2 Error Bound for IPM-TD(0) ˜ We now derive asymptotic error bounds for IPM-TD(0) with a constant stepsize ε, when the partial model is inaccurate. [sent-421, score-0.208]
53 In this problem, it is shown that values of the partially known MDP (11) capture meaningful physical quantities of the problem, thus, (11) may be seen as the natural representation for partial knowledge in such problems. [sent-489, score-0.183]
54 On the other hand, when the link is almost full, a clever policy might decide to save the available bandwidth for the more profitable requests, at the expense of rejecting the less profitable ones. [sent-495, score-0.217]
55 Thus, it is clear that a good policy should take into account both the bandwidth demand and profit of each request type, and its arrival frequency. [sent-496, score-0.27]
56 11 One approach to designing an admission policy is to formulate the problem as an MDP, for which an optimal policy is well defined, and solve it using RL approaches, as has been done by Marbach et al. [sent-502, score-0.456]
57 In the following we present this approach, and show that in this problem our partially known MDP definition emerges as a very natural representation for partial model knowledge. [sent-504, score-0.183]
58 The goal is to find the optimal policy with respect to the the average reward η = E[r(x)]. [sent-530, score-0.264]
59 We note that a learning policy may be required even when the model is fully known, as finding the optimal policy is often an intractable problem. [sent-532, score-0.358]
60 4 PARTIALLY K NOWN MDP For this problem, a natural definition for partial model knowledge is through the arrival and departure rates α, β, namely MK {m : m ∈ 1, . [sent-544, score-0.201]
61 Nevertheless, the key point here is that in the ratios between transition probabilities, the z terms cancel out, therefore the partial MDP definition (11) can be satisfied. [sent-552, score-0.222]
62 12 For each state-action pair, a Q value is maintained, and updated according to Qn+1 (xn , un ) = Qn (xn , un ) + εn r (xn , un ) + maxQn xn+1 , u′ − Qn (xn , un ) − ′ u 1 Qn (x, u) . [sent-557, score-0.744]
63 IPM Q-Learning was run with initial values Q0 (x, u) = r (x, u) and a step size εn = γ0 / (γ1 + vn (xn , un )) , where vn (x, u) denotes the number of visits to the state action pair (x, u) up to time n. [sent-585, score-0.247]
64 The action selection policy while learning was ε − greedy, with ε = 0. [sent-587, score-0.208]
65 The partial model for each experiment is represented by a single parameter k, such that the arrival and departure rates of all calls of type m ≤ k are known. [sent-589, score-0.201]
66 In the experiments, the agent maintains a stochastic policy function parametrized by θ ∈ RL·|U | , and given by T T ′ µθ (u|x) = eθ ξ(x,u) / ∑ eθ ξ(x,u ) , u′ where the state-action feature vectors ξ(x, u) ∈ {0, 1}L·|U | are constructed from the state features φ(x) defined in Section 4. [sent-595, score-0.324]
67 Average reward of the greedy policy is plotted vs. [sent-622, score-0.264]
68 Denote by 1K an indicator function that equals 1 if xn belongs to Kxn−1 ,un−1 and 0 otherwise. [sent-629, score-0.367]
69 These results indicate that the variance reduction in each iteration (guaranteed by Lemma 2) resulted, on average, in a better estimation of the gradient ∇θ η, and therefore a better policy at each step. [sent-635, score-0.248]
70 In this work we have presented a general method of integrating partial environmental knowledge into a large class of model free algorithms. [sent-659, score-0.276]
71 In a transfer learning or tutor learning settings, the partial model can come from an expert who has exact knowledge of a model that is partially similar. [sent-667, score-0.183]
72 An interesting possibility is to simultaneously gather information while adapting the policy using some model free algorithm. [sent-669, score-0.246]
73 Can we use this trajectory to construct an estimated partial MDP model, use it as in algorithm (12), and guarantee an improvement in the algorithm’s performance? [sent-671, score-0.215]
74 One may hope, that by the time of the n’th update of θ we could use the n − 1 values of xi already observed to build a partial model for xn , and similarly to (12), use it to manipulate (49) in such a way that guarantees a performance improvement (in the estimation of m). [sent-679, score-0.574]
75 Finally, we note that the IPM method adds to the algorithm a computational cost of O (Kmax ) evaluations of F (θn , xn , un , xn+1 ) at each iteration. [sent-683, score-0.553]
76 However, if the computation of F (θn , xn , un , xn+1 ) is demanding, one may face a tradeoff between the performance of the resulting policy and the computational cost of obtaining it. [sent-685, score-0.732]
77 Proof of Lemma 5 Proof By the ergodicity of the Markov chain the joint probability for subsequent states is lim P (xn , xn+1 ) = P ( xn+1 | xn ) [π]xn . [sent-699, score-0.416]
78 This will be done by introducing a ‘correction’ term Zn θn+1 = θn + εF (θn , xn , un , xn+1 ) + εZn , (51) where εZn is the vector of shortest Euclidean length needed to take θn + εF (θn , xn , un , xn+1 ) back to the constraint set H if it is not in H. [sent-741, score-1.106]
79 F (θn , xn , un , xn+1 ) I{|θn −θ∗ |≤ρ} is uniformly integrable for small ρ > 0. [sent-795, score-0.553]
80 Proof F (θn , xn , un , xn+1 ) is uniformly integrable since on every sample path θn is bounded (by the constraint), r (xn ) is bounded by rmax and φ (xn ) is also bounded by definition. [sent-796, score-0.582]
81 Since this is true for every sample path, F (θn , xn , un , xn+1 ) I{|θn −θ∗ |≤ρ} is uniformly integrable for all ρ. [sent-797, score-0.553]
82 4 For each K > 0, supE |F (θn , xn , un , xn+1 )|2 I{|θn −θ∗ |≤K} ≤ K1 E [V (θn ) + 1], where K1 does not n depend on K. [sent-832, score-0.553]
83 Proof Satisfying this requirement is immediate, since F (θn , xn , un , xn+1 ) is bounded on every sample path. [sent-833, score-0.553]
84 This gives ∞ ¯ ∑ (1 − ε)i−n En [g (θ, xi , ui ) − g (θ)] i=n ∞ ≤ ≤ = ¯ ∑ |En [g (θ, xi , ui ) − g (θ)]| i=n ∞ ∑ cρi−n |θ| i=n c |θ| , 1−ρ and E |Γn (θn )|2 ≤ ε2 E c |θn | 1−ρ 2 ε2 c ∗ 2 |θ | 1−ρ = O ε2 . [sent-839, score-0.19]
85 Proof We have Γn+1 (θn+1 ) − Γn+1 (θn ) ∞ = ε ∑ i=n+1 ∞ −ε (1 − ε)i−n−1 En+1 [g (θn+1 , xi , ui ) − g (θn+1 )] ¯ ∑ i=n+1 ∞ = ε ∑ i=n+1 (1 − ε)i−n−1 En+1 [g (θn , xi , ui ) − g (θn )] ¯ (1 − ε)i−n−1 En+1 [g (θn+1 , xi , ui ) − g (θn , xi , ui ) − (g (θn+1 ) − g (θn ))] . [sent-842, score-0.38]
86 We therefore have : ¯ ¯ |En+1 [g (θn+1 , xi , ui ) − g (θn , xi , ui )]| ≤ En+1 [|g (θn+1 , xi , ui ) − g (θn , xi , ui )|] < kε, 1960 I NTEGRATING A PARTIAL M ODEL INTO M ODEL F REE RL and similarly ¯ |En+1 [g (θn+1 ) − g (θn )]| < kε. [sent-844, score-0.38]
87 n+1 δMn − δMn (θ∗ ) = a (xn )T (θn − θ∗ ) b (xn ) , where a and b are vector valued functions of xn . [sent-853, score-0.367]
88 1 The following equations hold: ∞ lim supE N→∞ n ∞ lim supE N→∞ n ∑ E F (θ∗ , x j , u j , x j+1 ) xn , un E ( F (θ∗ , xn , un , xn+1 ) F (θ∗ , xi , ui , xi+1 )| xn , un )T ∑ = 0, j=n+N = 0. [sent-869, score-1.852]
89 1962 j=N cρN I NTEGRATING A PARTIAL M ODEL INTO M ODEL F REE RL The same goes for the covariance, since there exists some ρ′ < 1 and some matrix c′ such that E ( F (θ∗ , xn , un , xn+1 ) F (θ∗ , xi , ui , xi+1 )| xn , un )T < c′ ρ′i−n . [sent-871, score-1.201]
90 2 The sets |F (θ∗ , xn , un , xn+1 )|2 and 2 ∞ ∑ E F (θ∗ , x j , u j , x j+1 ) xi , ui are uniformly in- j=i tegrable. [sent-873, score-0.648]
91 Proof As was shown before, F (θ∗ , xn , un , xn+1 ) is bounded, and therefore |F (θ∗ , xn , un , xn+1 )|2 is uniformly integrable. [sent-874, score-1.106]
92 1, for every i ∞ F (θ∗ , x j , u j , x j+1 ) xi , ui ∑E j=i c , 1−ρ 2 ∞ which is bounded, and therefore ≤ ∑ E F (θ∗ , x j , u j , x j+1 ) xi , ui is uniformly integrable. [sent-876, score-0.19]
93 3 There is a matrix Σ0 such that 1 n+m−1 ∑ E F (θ∗ , x j , u j , x j+1 ) F (θ∗ , x j , u j , x j+1 )T xn , un − Σ0 → 0 m j=n in probability as n, m → ∞. [sent-878, score-0.553]
94 Proof Since the Markov chain is ergodic, by the law of large numbers this is satisfied by defining Σ0 = lim E F (θ∗ , xn , un , xn+1 ) F (θ∗ , xn , un , xn+1 )T . [sent-879, score-1.155]
95 4 There is a matrix Σ1 such that 1 n+m−1 ∞ ∑ ∑ E F (θ∗ , x j , u j , x j+1 ) F (θ∗ , xk , uk , xk+1 )T xn , un − Σ1 → 0 m j=n k= j+1 in probability as n, m → ∞. [sent-881, score-0.553]
96 Proof Since the Markov chain is ergodic, by the law of large numbers this is satisfied by defining ∞ Σ1 = lim ∑ n→∞E F (θ∗ , xn , un , xn+1 ) F (θ∗ , xn+ j , un+ j , xn+ j+1 )T . [sent-882, score-0.602]
97 The set {∇θ g (θ∗ , xn , un )} is uniformly integrable. [sent-888, score-0.553]
98 Proof As was shown above, ∇θ g (θ∗ , xn , un ) is clearly bounded, and therefore uniformly integrable. [sent-889, score-0.553]
99 There is a Hurwitz matrix A such that 1 n+m+1 ∑ E ∇θ gT (θ∗ , x j , u j ) xn , un − A → 0 m j=n in probability as ε → 0 and n, m → ∞. [sent-891, score-0.553]
100 Then, by the law of large numbers, we have 1 n+m+1 ∑ E ∇θ gT (θ∗ , xi , ui ) xn , un − A → 0. [sent-894, score-0.648]
wordName wordTfidf (topN-words)
[('xn', 0.367), ('mdp', 0.216), ('mse', 0.205), ('rl', 0.201), ('tamar', 0.195), ('pk', 0.189), ('odel', 0.187), ('un', 0.186), ('policy', 0.179), ('astro', 0.177), ('eir', 0.177), ('ntegrating', 0.169), ('ipm', 0.164), ('kushner', 0.16), ('dn', 0.156), ('ek', 0.15), ('partial', 0.148), ('ode', 0.144), ('sa', 0.139), ('vark', 0.137), ('kxn', 0.133), ('ree', 0.13), ('kx', 0.122), ('td', 0.12), ('garnet', 0.106), ('pun', 0.106), ('admission', 0.098), ('var', 0.093), ('yin', 0.089), ('reward', 0.085), ('xx', 0.083), ('kmax', 0.082), ('agent', 0.074), ('ui', 0.071), ('free', 0.067), ('environmental', 0.061), ('asymptotic', 0.06), ('tsitsiklis', 0.058), ('estimator', 0.056), ('bertsekas', 0.055), ('arrival', 0.053), ('fa', 0.053), ('mn', 0.049), ('lim', 0.049), ('transition', 0.048), ('environment', 0.048), ('fnk', 0.046), ('en', 0.045), ('pu', 0.044), ('reinforcement', 0.041), ('iterate', 0.04), ('stochastic', 0.039), ('bandwidth', 0.038), ('inaccuracy', 0.038), ('variance', 0.036), ('borkar', 0.035), ('maxqn', 0.035), ('partially', 0.035), ('improvement', 0.035), ('marbach', 0.034), ('lyapunov', 0.034), ('zn', 0.034), ('iteration', 0.033), ('dl', 0.033), ('barto', 0.033), ('hybrid', 0.033), ('state', 0.032), ('trajectory', 0.032), ('uctuations', 0.031), ('customers', 0.03), ('dayan', 0.03), ('covk', 0.03), ('ergodic', 0.03), ('ml', 0.03), ('theorem', 0.03), ('rmax', 0.029), ('proof', 0.029), ('action', 0.029), ('temporal', 0.029), ('bias', 0.028), ('sutton', 0.028), ('markovian', 0.027), ('pkx', 0.027), ('schervish', 0.027), ('supe', 0.027), ('gk', 0.027), ('inaccurate', 0.027), ('transitions', 0.027), ('lemma', 0.026), ('service', 0.026), ('weakly', 0.026), ('unbiased', 0.026), ('converges', 0.026), ('ratios', 0.026), ('actions', 0.026), ('wiener', 0.025), ('zt', 0.025), ('xi', 0.024), ('ds', 0.024), ('horn', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
2 0.16145511 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
3 0.13441356 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
4 0.10378312 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
5 0.095492341 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
6 0.09007699 20 jmlr-2012-Analysis of a Random Forests Model
7 0.08728338 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
8 0.073245361 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
9 0.070890814 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
10 0.067721561 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
11 0.066405654 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
12 0.064234689 59 jmlr-2012-Linear Regression With Random Projections
13 0.061857205 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
14 0.059589159 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
15 0.055712979 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
16 0.055263147 54 jmlr-2012-Large-scale Linear Support Vector Regression
17 0.055168513 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
18 0.047436085 56 jmlr-2012-Learning Linear Cyclic Causal Models with Latent Variables
19 0.046707124 80 jmlr-2012-On Ranking and Generalization Bounds
20 0.043631863 73 jmlr-2012-Multi-task Regression using Minimal Penalties
topicId topicWeight
[(0, -0.219), (1, 0.063), (2, -0.019), (3, -0.218), (4, 0.071), (5, -0.218), (6, -0.247), (7, -0.152), (8, -0.014), (9, 0.076), (10, -0.169), (11, 0.237), (12, 0.095), (13, -0.036), (14, -0.033), (15, 0.024), (16, 0.052), (17, -0.08), (18, -0.053), (19, 0.094), (20, -0.127), (21, -0.091), (22, 0.046), (23, 0.018), (24, -0.09), (25, -0.046), (26, -0.036), (27, -0.026), (28, 0.015), (29, 0.095), (30, 0.005), (31, -0.034), (32, -0.072), (33, -0.01), (34, 0.008), (35, 0.087), (36, -0.089), (37, 0.006), (38, -0.082), (39, 0.069), (40, 0.029), (41, -0.06), (42, 0.087), (43, -0.04), (44, -0.084), (45, 0.07), (46, -0.069), (47, -0.043), (48, 0.042), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.96055311 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
2 0.70286918 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
3 0.65100199 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
4 0.51642203 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
5 0.45728847 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
Author: George Konidaris, Ilya Scheidwasser, Andrew Barto
Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to significantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that significantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-specific skills. Keywords: reinforcement learning, transfer, shaping, skills
6 0.40891099 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
7 0.38778657 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
8 0.37821969 20 jmlr-2012-Analysis of a Random Forests Model
9 0.35928264 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing
10 0.34059244 59 jmlr-2012-Linear Regression With Random Projections
11 0.33501005 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
12 0.30066657 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
13 0.29636919 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
14 0.29133281 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
15 0.288719 54 jmlr-2012-Large-scale Linear Support Vector Regression
16 0.2877191 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
17 0.27068928 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
18 0.25737667 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
19 0.25560707 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
20 0.2401576 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
topicId topicWeight
[(21, 0.027), (26, 0.033), (29, 0.035), (35, 0.011), (57, 0.017), (69, 0.011), (75, 0.027), (77, 0.012), (79, 0.015), (92, 0.638), (96, 0.074)]
simIndex simValue paperId paperTitle
1 0.98746306 59 jmlr-2012-Linear Regression With Random Projections
Author: Odalric-Ambrym Maillard, Rémi Munos
Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction
2 0.98393607 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
same-paper 3 0.97563595 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
Author: Alain Hauser, Peter Bühlmann
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search
5 0.90011513 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
6 0.8747685 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
7 0.85123545 34 jmlr-2012-Dynamic Policy Programming
8 0.8065843 68 jmlr-2012-Minimax Manifold Estimation
9 0.79615772 111 jmlr-2012-Structured Sparsity and Generalization
10 0.7765016 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
11 0.77462357 82 jmlr-2012-On the Necessity of Irrelevant Variables
12 0.7489534 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
13 0.74629283 73 jmlr-2012-Multi-task Regression using Minimal Penalties
14 0.74102527 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
15 0.73732191 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
16 0.7302314 13 jmlr-2012-Active Learning via Perfect Selective Classification
17 0.72803384 80 jmlr-2012-On Ranking and Generalization Bounds
18 0.72260493 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
19 0.71802199 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
20 0.71393931 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning