jmlr jmlr2010 jmlr2010-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
Reference: text
sentIndex sentText sentNum sentScore
1 In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). [sent-8, score-0.539]
2 √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. [sent-9, score-0.594]
3 Thus, an MDP M and an algorithm A operating on M with initial state s constitute a stochastic process described by the states st visited at time step t, the actions at chosen by A at step t, and the rewards rt obtained (t ∈ N). [sent-30, score-0.527]
4 The limit ρ(M, A, s) := lim 1 T →∞ T 1 T E [R(M, A, s, T )] E [R(M, A, s, T )] is called the average reward and can be maximized by an appropriate stationary policy π : S → A which determines an optimal action for each state (see Puterman, 1994). [sent-34, score-0.597]
5 The diameter D is the time it takes to move from any state s to any other state s′ , using an appropriate policy for each pair of states s, s′ : Definition 1 Consider the stochastic process defined by a stationary policy π : S → A operating on an MDP M with initial state s. [sent-38, score-0.784]
6 ) In any case, a finite diameter seems necessary for interesting bounds on the regret of any algorithm with respect to an optimal policy. [sent-44, score-0.46]
7 Thus, compared to the simpler multi-armed bandit problem where each arm a is typically explored log T times (depending on the gap g between the optimal reward and the reward g for arm a)—see, for example, the regret bounds of Auer et al. [sent-46, score-0.787]
8 , 2002b) correspondingly translate into a regret bound of Θ( D|S ||A |T ) for MDPs with diameter D. [sent-49, score-0.45]
9 π 1564 N EAR - OPTIMAL R EGRET B OUNDS FOR R EINFORCEMENT L EARNING The optimal average reward is the natural benchmark1 for a learning algorithm A, and we define the total regret of A after T steps as ∆(M, A, s, T ) := T ρ∗ (M) − R(M, A, s, T ). [sent-56, score-0.509]
10 Unlike other parameters that have been proposed for various PAC and regret bounds, such as the mixing time (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002) or the hitting time of an optimal policy (Tewari and Bartlett, 2008) (cf. [sent-61, score-0.532]
11 Their definition of regret measures the difference between the rewards3 of an optimal policy and the rewards of the learning algorithm along the trajectory taken by the learning algorithm. [sent-77, score-0.653]
12 In contrast, we are interested in the regret of the learning algorithm in respect to the rewards of the optimal policy along the trajectory of the optimal policy. [sent-78, score-0.672]
13 Then any policy which maximizes immediate rewards achieves 0 regret in the notion of Strehl and Littman. [sent-96, score-0.634]
14 But such a policy may not move to states where the optimal reward is obtained. [sent-97, score-0.457]
15 Theorem 5 For any algorithm A, any natural numbers S, A ≥ 10, D ≥ 20 logA S, and T ≥ DSA, there is an MDP M with S states, A actions, and diameter D,5 such that for any initial state s ∈ S the expected regret of A after T steps is √ E [∆(M, A, s, T )] ≥ 0. [sent-128, score-0.49]
16 , its transition probabilities and reward distributions) is allowed to change (ℓ − 1) times up to step T , such that the diameter is always at most D. [sent-133, score-0.47]
17 , the regret (now measured as the sum of missed rewards compared to the ℓ optimal policies in the periods during which the MDP remains constant) is upper bounded by 65 · ℓ1/3 T 2/3 DS A log T δ with probability of at least 1 − δ. [sent-137, score-0.574]
18 √ The achieved regret bounds are O( ℓT log T ) in the first two mentioned papers, while Yu and Mannor (2009) derive regret bounds of O(ℓ log T ) for a setting with side observations on past rewards in which the number of changes ℓ need not be known in advance. [sent-142, score-0.829]
19 In this setting, an upper bound of O( T ) on the regret against an optimal stationary policy (with the reward changes known in advance) is best possible and has been derived by Even-Dar et al. [sent-147, score-0.763]
20 (2009), who also show that for achieving sublinear regret it is essential that the changing rewards are chosen obliviously, as an opponent who chooses the rewards depending on the learner’s history may inflict linear loss on the learner. [sent-150, score-0.552]
21 While in the stochastic setting the average reward of an MDP is always maximized by a stationary policy π : S → A , in the nonstochastic setting obviously a dynamic policy adapted to the reward sequence would in general earn more than a stationary policy. [sent-152, score-0.877]
22 However, obviously no algorithm will be able to compete with the best dynamic policy for all possible reward sequences, so that—similar to the nonstochastic bandit problem, compare to Auer et al. [sent-153, score-0.484]
23 That is, if the rewards are not bounded in [0, 1] but taken from some interval [rmin , rmax ], the rewards can simply be normalized, so that the given regret bounds hold with additional factor (rmax − rmin ). [sent-158, score-0.572]
24 More precisely, U CRL 2 ˜ (see Figure 1) proceeds in episodes and computes a new policy πk only at the beginning of each episode k. [sent-164, score-0.682]
25 In Steps 2–3, U CRL 2 computes estimates rk (s, a) and pk (s′ |s, a) for the mean rewards and the ˆ ˆ transition probabilities from the observations made before episode k. [sent-166, score-0.64]
26 Episode k ˜ ends when a state s is visited in which the action a = πk (s) induced by the current policy has been chosen in episode k equally often as before episode k. [sent-171, score-0.764]
27 The counts vk (s, a) keep track of these occurrences in episode k. [sent-173, score-0.489]
28 While value iteration typically calculates an optimal policy for a fixed MDP, we also MDP M ˜ need to select an optimistic MDP Mk that gives almost maximal optimal average reward among all plausible MDPs. [sent-176, score-0.545]
29 8 Then for each policy π+ on M + there is an MDP M ∈ M and a policy π : S → A ˜ ˜ ˜ ˜ on M such that the policies π+ and π induce the same transition probabilities and mean rewards on ˜ the respective MDP. [sent-190, score-0.782]
30 Thus, finding ˜ the same transition probabilities and rewards are induced by π ˜ ˜ ˜ ˜ ˜ an MDP M ∈ M and a policy π on M such that ρ(M, π, s) = maxM′ ∈M ,π,s′ ρ(M ′ , π, s′ ) for all initial ˜ states s, corresponds to finding an average reward optimal policy on M + . [sent-193, score-0.959]
31 Since the policy πk is fixed for episode k, vk (s, a) = 0 only for a = πk (s). [sent-195, score-0.692]
32 For all (s, a) in S × A initialize the state-action counts for episode k, vk (s, a) := 0. [sent-209, score-0.489]
33 For s, s′ ∈ S and a ∈ A set the observed accumulated rewards and the transition counts prior to episode k, Rk (s, a) := tk −1 ∑ rτ 1s =s,a =a , τ τ=1 τ Pk s, a, s′ := # τ < tk : sτ = s, aτ = a, sτ+1 = s′ . [sent-212, score-0.658]
34 While vk (st , πk (st )) < max{1, Nk (st , πk (st ))} do ˜ (a) Choose action at = πk (st ), obtain reward rt , and observe next state st+1 . [sent-220, score-0.657]
35 The idea is to put as much transition probability as possible to the state with maximal value ui (s) at the expense of transition probabilities to states with small values ui (s). [sent-248, score-0.592]
36 ) Indeed, extended value iteration always chooses a policy with aperiodic transition matrix: In each iteration there is a single fixed state s′ 1 which is regarded as the “best” target state. [sent-265, score-0.553]
37 Further, stopping extended value iteration when max ui+1 (s) − ui (s) − min ui+1 (s) − ui (s) < ε, s∈S s∈S the greedy policy with respect to ui is ε-optimal. [sent-272, score-0.606]
38 Recall that for a policy π the bias λ(s) in state s is basically the expected advantage in total reward (for T → ∞) of starting in state s over starting in the stationary distribution (the long term probability of being in a state) of π. [sent-274, score-0.528]
39 For a fixed policy π, the Poisson equation λ = r − ρ1 + P λ relates the bias vector λ to the average reward ρ, the mean reward vector r, and the transition matrix P . [sent-275, score-0.722]
40 As we will see in inequality (11) below, the so-called span maxs ui (s) − mins ui (s) of the vector ui is upper bounded by the diameter D, so that this also holds for the span of the bias vector λ of the optimal policy found by extended value iteration, that is, maxs λ(s) − mins λ(s) ≤ D. [sent-277, score-0.762]
41 1572 N EAR - OPTIMAL R EGRET B OUNDS FOR R EINFORCEMENT L EARNING Now returning to Step 5 of U CRL 2, we stop value iteration when 1 max ui+1 (s) − ui (s) − min ui+1 (s) − ui (s) < √ , s∈S s∈S tk which guarantees by Theorem 7 that the greedy policy with respect to ui is (6) 1 √ -optimal. [sent-287, score-0.696]
42 Further, the regret is expressed as the sum of the regret accumulated in the individual episodes. [sent-292, score-0.6]
43 That is, we set the regret in episode k to be ∆k := ∑ vk (s, a) ρ∗ − r(s, a) , ¯ s,a where vk (s, a) now denotes the final counts of state-action pair (s, a) in episode k. [sent-293, score-1.248]
44 After this intermezzo, the regret of episodes for which the true MDP M ∈ Mk is examined in Section 4. [sent-298, score-0.567]
45 2 and split into ˜ ˜ vk (Pk − I)wk = vk (Pk − Pk )wk + vk (Pk − I)wk ˜ ≤ vk (Pk − Pk ) wk ∞ + vk (Pk − I)wk , 1 where Pk is the true transition matrix (in M) of the policy applied in episode k. [sent-309, score-2.074]
46 3 concludes the analysis of episodes with M ∈ Mk by summing the individual regret terms over all episodes k with M ∈ Mk . [sent-317, score-0.865]
47 Denoting the number of episodes started up to step T by m, we have ∑m vk (s, a) = N(s, a) and ∑s,a N(s, a) = T . [sent-326, score-0.564]
48 2 Dealing with Failing Confidence Regions Let us now consider the regret of episodes in which the set of plausible MDPs Mk does not contain the true MDP M, ∑m ∆k 1M∈Mk . [sent-329, score-0.567]
49 By the stopping criterion for episode k we have (except for k=1 episodes where vk (s, a) = 1 and Nk (s, a) = 0, when ∑s,a vk (s, a) = 1 ≤ tk holds trivially) ∑ vk (s, a) s,a ≤ ∑ Nk (s, a) = tk − 1. [sent-330, score-1.5]
50 3 Episodes with M ∈ Mk Now we assume that M ∈ Mk and start by considering the regret in a single episode k. [sent-336, score-0.472]
51 The optimistic ˜ ˜ average reward ρk of the optimistically chosen policy πk is essentially larger than the true optimal ∗ , and thus it is sufficient to calculate by how much the optimistic average reward ρ ˜k average reward ρ ˜k ˜ ˜ overestimates the actual rewards of policy πk . [sent-337, score-1.258]
52 Thus for the regret ∆k accumulated in episode k we obtain ∆k ≤ ∑ vk (s, a) s,a ρ∗ − r(s, a) ≤ ¯ ∑ vk (s, a) s,a vk (s, a) ˜ . [sent-339, score-1.353]
53 As an important observation for our analysis, we find that for any iteration i the range of the state values is bounded by the diameter of the MDP M, that is, max ui (s) − min ui (s) ≤ D. [sent-344, score-0.477]
54 (11) s s To see this, observe that ui (s) is the total expected i-step reward of an optimal non-stationary i-step ˜ policy starting in state s on the MDP M + with extended action set (as considered for extended value iteration). [sent-345, score-0.685]
55 Now, if there were states s′ , s′′ with ui (s′′ ) − ui (s′ ) > D, then an improved value for ui (s′ ) could be achieved by the following nonstationary policy: First follow a policy which moves from s′ to s′′ most quickly, which takes at most D steps on average. [sent-347, score-0.595]
56 Since only D of the i rewards of the policy for s′′ are missed, this policy gives ui (s′ ) ≥ ui (s′′ ) − D, contradicting our assumption and thus proving (11). [sent-349, score-0.781]
57 of Puterman (1994), that when the convergence criterion (6) holds at iteration i, then 1 ˜ |ui+1 (s) − ui (s) − ρk | ≤ √ tk (12) ˜ ˜ for all s ∈ S , where ρk is the average reward of the policy πk chosen in this iteration on the optimistic ˜ k . [sent-353, score-0.743]
58 9 Expanding ui+1 (s) according to (5), we get MDP M ˜ ˜ ui+1 (s) = rk (s, πk (s)) + ∑ pk s′ |s, πk (s) · ui (s′ ) ˜ ˜ s′ and hence by (12) ˜ ˜ ρk − rk (s, πk (s)) − ˜ ˜ ∑ pk s′ ˜ s′ |s, πk (s) · ui (s′ ) − ui (s) 1 ≤√ . [sent-354, score-0.679]
59 tk (13) ˜ ˜ Setting rk := rk s, πk (s) s to be the (column) vector of rewards for policy πk , ˜ ˜k := pk (s′ |s, πk (s)) ′ the transition matrix of πk on Mk , and vk := vk s, πk (s) ˜ ˜ ˜ ˜ P ˜ the (row) s,s s ˜ 9. [sent-355, score-1.354]
60 ˜ ¯ tk s,a s,a ˜ Since the rows of Pk sum to 1, we can replace ui by wk where we set wk (s) := ui (s) − mins ui (s) + maxs ui (s) , 2 such that it follows from (11) that wk ≤ D . [sent-359, score-0.84]
61 Further, since we assume M ∈ Mk , rk (s, a) − r(s, a) ≤ ˜ ¯ 2 |˜k (s, a) − rk (s, a)| + |¯(s, a) − rk (s, a)| is bounded according to (3), so that r ˆ r ˆ ˜ ∆k ≤ vk Pk − I wk + 2 ∑ vk (s, a) vk (s, a) +2∑ √ . [sent-360, score-1.075]
62 max{1, Nk (s, a)} (15) 7 log(2SAtk /δ) 2 max{1,Nk (s,a)} s,a Noting that max{1, Nk (s, a)} ≤ tk ≤ T we get from (14) that ˜ ∆k ≤ vk Pk − I wk + 14 log 2SAT δ +2 ∑ s,a 4. [sent-362, score-0.519]
63 2 T HE T RUE T RANSITION M ATRIX ˜ ˜ ˜ Now we want to replace the transition matrix Pk of the policy πk in the optimistic MDP Mk by the ′ |s, π (s)) ˜ transition matrix Pk := p(s ˜ k of πk in the true MDP M. [sent-364, score-0.556]
64 Thus, we write s,s′ ˜ ˜ vk Pk − I wk = vk Pk − Pk + Pk − I wk ˜ = vk Pk − Pk wk + vk Pk − I wk . [sent-365, score-1.544]
65 The intuition about the second term in (16) is that the counts of the state visits vk are relatively close to the stationary distribution µk of the transition matrix Pk , for which µk Pk = µk , such that vk Pk − I should be small. [sent-371, score-0.85]
66 Then for any episode k with M ∈ Mk , we have due to wk ∞ ≤ D that 2 vk (Pk − I)wk = tk+1 −1 ∑ p (·|st , at ) − est wk t=tk tk+1 −1 ∑ = t=tk = tk+1 −1 ∑ t=tk ≤ tk+1 −1 ∑ p (·|st , at ) − tk+1 −1 ∑ t=tk est+1 + estk+1 − estk wk Xt + wk (stk+1 ) − wk (stk ) Xt + D . [sent-387, score-0.964]
67 , st , at = 0, so that Xt is a sequence of martingale differences, and application of Lemma 10 gives δ 8T T P 5 ∑ Xt ≥ D 2T · 4 log t=1 8T δ ≤ Since for the number of episodes we have m ≤ SA log2 over all episodes yields k=1 < δ . [sent-392, score-0.788]
68 We say that an episode k is ε-bad if its average regret is more than ε, where the average regret of an tk+1 −1 episode of length ℓk is ∆kk with10 ∆k = ∑t=tk (ρ∗ − rt ). [sent-407, score-0.978]
69 Proof The proof is an adaptation of the proof of Theorem 2 which gives an upper bound of O DS Lε A log(AT /δ) on the regret ∆′ (s, T ) in ε-bad episodes in terms of Lε . [sent-411, score-0.649]
70 + (25) k k∈Kε In order to bound the regret of a single episode with M ∈ Mk we follow the lines of the proof of Theorem 2 in Section 4. [sent-423, score-0.528]
71 Combining (15), (16), and (17) we have that ∆k ≤ vk Pk − I wk + 2D 14S log 2AT δ +2 vk (s, a) . [sent-425, score-0.716]
72 (28) k∈Kε For the regret term of ∑k∈Kε vk (Pk − I)wk 1M∈Mk we use an argument similar to the one applied to obtain (18) in Section 4. [sent-428, score-0.577]
73 Proof of Theorem 4 Upper bounding Lε in (32) by (33), we obtain for the regret ∆′ (s, T ) accumuε lated in ε-bad episodes that D2 S2 A log T δ ∆′ (s, T ) ≤ 342 · ε ε with probability at least 1 − 3δ. [sent-453, score-0.61]
74 Noting that the regret accumulated outside of ε-bad episodes is at most εT implies the first statement of the theorem. [sent-454, score-0.587]
75 What remains to do is to consider g g ˜ episodes k with expected average regret smaller than 2 in which however a non-optimal policy πk was chosen. [sent-456, score-0.79]
76 First, note that for each policy π there is a Tπ such that for all T ≥ Tπ the expected average g reward after T steps is 2 -close to the average reward of π. [sent-457, score-0.605]
77 Thus, when a policy π is played g in an episode of length ≥ Tπ either the episode is 2 -bad (in expectation) or the policy π is optimal. [sent-458, score-0.81]
78 Now we fix a state-action pair (s, a) and consider the episodes k in which the number of visits vk (s, a) in (s, a) is doubled. [sent-459, score-0.614]
79 The corresponding episode lengths ℓk (s, a) are not necessarily increasing, but the vk (s, a) are monotonically increasing, and obviously ℓk (s, a) ≥ vk (s, a). [sent-460, score-0.756]
80 Since the vk (s, a) are at least doubled, it takes at most ⌈1 + log2 (maxπ:π(s)=a Tπ )⌉ episodes until ℓk (s, a) ≥ vk (s, a) ≥ maxπ:π(s)=a Tπ , when any policy π with π(s) = a applied in episode k that is g not 2 -bad (in expectation) will be optimal. [sent-461, score-1.256]
81 Consequently, as only episodes of length smaller than maxπ:π(s)=a Tπ have to be considered, the regret of episodes k where vk (s, a) < maxπ:π(s)=a Tπ is upper bounded by ⌈1 + log2 (maxπ:π(s)=a Tπ )⌉ maxπ:π(s)=a Tπ . [sent-462, score-1.15]
82 For this task we define the regret of an algorithm A up to step T with respect to the average reward ρ∗ (t) of an optimal policy at step t as T ∆′ (A, s, T ) := ∑ ρ∗ (t) − rt , t=1 where rt is as usual the reward received by A at step t when starting in state s. [sent-657, score-1.016]
83 For each other period the total regret for periods in which the MDP changes is at most ℓ ˜ we have regret of O ( T )1/3 by Theorem 2. [sent-659, score-0.605]
84 Then the sum of the minimal expected transition times to states in U when starting in an initial state distributed according to d0 is bounded as follows: T (U |d0 ) := ∑ ∑ d0 (s0 ) T ∗ (s|s0) s∈U s0 ∈S ≥ min 0≤nk ≤Ak ,k≥0, ∑k nk =|U | ∑ k · nk . [sent-714, score-0.447]
85 Proof of Theorem 14 Let a∗ (s0 , s) be the optimal action in state s0 for reaching state s, and let p(s|s0 , a) be the transition probability to state s when choosing action a in state s0 . [sent-728, score-0.566]
86 4 of Puterman (1994)—the main result on convergence of value iteration—needs this assumption only at one step, that is, to guarantee that the optimal policy identified at the end of the proof has aperiodic transition matrix. [sent-745, score-0.448]
87 3 of Puterman (1994) shows that value iteration eventually chooses only policies π that satisfy Pπ ρ∗ = ρ∗ , where Pπ is the transition matrix of π and ρ∗ is the optimal average reward vector. [sent-749, score-0.452]
88 More precisely, there is an i0 such that for all i ≥ i0 max {rπ + Pπ ui } = max{rπ + Pπ ui }, π∈E π:S →A where rπ is the reward vector of the policy π, and E := {π : S → A | Pπ ρ∗ = ρ∗ }. [sent-750, score-0.642]
89 2 A Bound on the Number of Episodes Since in each episode the total number of visits to at least one state-action pair doubles, the number of episodes m is logarithmic in T . [sent-805, score-0.531]
90 In each episode k < m there is a state-action pair (s, a) with vk (s, a) = Nk (s, a) (or vk (s, a) = 1, Nk (s, a) = 0). [sent-809, score-0.756]
91 Let K(s, a) be the number of episodes with vk (s, a) = Nk (s, a) and Nk (s, a) > 0. [sent-810, score-0.564]
92 If N(s, a) > 0, then vk (s, a) = Nk (s, a) implies Nk+1 (s, a) = 2Nk (s, a), so that m N(s, a) = ∑ vk (s, a) k=1 ≥ 1+ K(s,a) ∑ k:vk (s,a)=Nk (s,a) Nk (s, a) ≥ 1 + ∑ 2i−1 = 2K(s,a) . [sent-811, score-0.574]
93 (45) Now, in each episode a state-action pair (s, a) is visited for which either Nk (s, a) = 0 or Nk (s, a) = vk (s, a). [sent-814, score-0.492]
94 k=1 n−1 Since zn ≤ Zn−1 = ∑k=1 zk and Zn−1 + zn = Zn , we further have √ 2+1 √ 2+1 √ 2+1 = zn Zn−1 √ = Zn−1 + √ 2+1 √ 2+1 = ≤ 2 2 Zn−1 + 2 √ 2 + 1 zn + z2 n Zn−1 √ Zn−1 + 2 + 2 2 + 1 zn √ 2 2 + 1 zn √ √ Zn−1 + zn = 2+1 Zn , 2 Zn−1 + which proves the lemma. [sent-830, score-0.551]
95 Technical Details for the Proof of Theorem 4: Proof of (27) For a given index set Kε of episodes we would like to bound the sum ∑∑ k∈Kε s,a m vk (s, a) = max{1, Nk (s, a)} ∑∑ s,a k=1 vk (s, a) 1k∈Kε . [sent-839, score-0.881]
96 In the following / we show that the contribution of episodes that occur after step Lε := ∑k∈Kε ∑s,a vk (s, a) is not larger than the missing contributions of the episodes ∈ Kε . [sent-842, score-0.841]
97 Intuitively speaking, one may fill the episodes / that occur after step Lε into the gaps of episodes ∈ Kε as Figure 6 suggests. [sent-843, score-0.554]
98 Shaded boxes stand for episodes ∈ Kε , empty boxes for episodes ∈ Kε . [sent-845, score-0.554]
99 The contribution of episodes after step Lε can be “filled into the gaps” of / episodes ∈ Kε before step Lε . [sent-846, score-0.554]
100 / (48) By (47) and due to dk ≥ dmε for k ≥ mε we have m vk ∑ dk 1k∈Kε k=1 ≤ mε −1 ∑ k=1 + ℓε − Nmε vk 1mε ∈Kε 1k∈Kε + dk dmε m 1 dmε Nmε +1 − ℓε 1mε ∈Kε + ∑ vk 1k∈Kε . [sent-852, score-0.984]
wordName wordTfidf (topN-words)
[('mdp', 0.346), ('regret', 0.29), ('vk', 0.287), ('episodes', 0.277), ('crl', 0.242), ('auer', 0.227), ('policy', 0.223), ('punif', 0.189), ('reward', 0.182), ('episode', 0.182), ('st', 0.171), ('pk', 0.147), ('transition', 0.135), ('diameter', 0.13), ('jaksch', 0.124), ('rewards', 0.121), ('ka', 0.111), ('ui', 0.107), ('mdps', 0.106), ('nk', 0.104), ('action', 0.102), ('mk', 0.101), ('wk', 0.099), ('sa', 0.099), ('ea', 0.094), ('actions', 0.093), ('tk', 0.09), ('rtner', 0.086), ('einforcement', 0.086), ('egret', 0.086), ('eunif', 0.083), ('puterman', 0.077), ('ear', 0.077), ('zn', 0.073), ('tewari', 0.071), ('reinforcement', 0.063), ('ounds', 0.063), ('optimistic', 0.063), ('dsa', 0.059), ('policies', 0.057), ('state', 0.052), ('nm', 0.051), ('visits', 0.05), ('bandit', 0.05), ('loga', 0.047), ('aperiodic', 0.045), ('strehl', 0.045), ('pa', 0.044), ('log', 0.043), ('dk', 0.041), ('ortner', 0.041), ('zk', 0.04), ('iteration', 0.039), ('dence', 0.035), ('brafman', 0.035), ('mix', 0.035), ('earning', 0.034), ('rt', 0.034), ('states', 0.033), ('dm', 0.032), ('rk', 0.032), ('bartlett', 0.031), ('ti', 0.03), ('communicating', 0.03), ('undiscounted', 0.03), ('bound', 0.03), ('nonstochastic', 0.029), ('tennenholtz', 0.029), ('ds', 0.027), ('learner', 0.026), ('proof', 0.026), ('mannor', 0.025), ('mins', 0.025), ('periods', 0.025), ('theorem', 0.024), ('burnetas', 0.024), ('transient', 0.023), ('pac', 0.023), ('probabilities', 0.023), ('visited', 0.023), ('max', 0.023), ('claimed', 0.022), ('logarithmic', 0.022), ('kearns', 0.022), ('xt', 0.022), ('bounds', 0.021), ('summing', 0.021), ('counts', 0.02), ('lemma', 0.02), ('chooses', 0.02), ('martingale', 0.02), ('accumulated', 0.02), ('kakade', 0.02), ('stationary', 0.019), ('kl', 0.019), ('bounded', 0.019), ('optimal', 0.019), ('steps', 0.018), ('peter', 0.018), ('sham', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
2 0.20563069 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
3 0.11669865 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
Author: Tong Zhang
Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation
4 0.11218743 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
5 0.08833988 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
6 0.079686105 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
7 0.079193316 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
8 0.076992311 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
9 0.067857884 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
10 0.066169649 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
11 0.061931189 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
12 0.059505332 37 jmlr-2010-Evolving Static Representations for Task Transfer
13 0.058098506 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
14 0.054588456 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
15 0.053149413 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
16 0.051439665 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
17 0.048096843 93 jmlr-2010-PyBrain
18 0.043541029 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks
19 0.041774422 109 jmlr-2010-Stochastic Composite Likelihood
20 0.040326715 25 jmlr-2010-Composite Binary Losses
topicId topicWeight
[(0, -0.164), (1, -0.121), (2, 0.072), (3, -0.205), (4, 0.056), (5, -0.088), (6, -0.025), (7, 0.173), (8, 0.192), (9, 0.196), (10, 0.317), (11, 0.049), (12, 0.107), (13, -0.206), (14, 0.076), (15, 0.193), (16, 0.129), (17, -0.027), (18, 0.024), (19, -0.041), (20, -0.102), (21, -0.064), (22, 0.086), (23, 0.133), (24, -0.043), (25, -0.019), (26, -0.099), (27, 0.086), (28, 0.035), (29, -0.05), (30, -0.01), (31, -0.108), (32, -0.116), (33, -0.033), (34, -0.053), (35, 0.08), (36, 0.014), (37, -0.092), (38, -0.0), (39, -0.076), (40, -0.018), (41, 0.11), (42, 0.043), (43, -0.005), (44, -0.096), (45, -0.071), (46, 0.003), (47, -0.029), (48, -0.014), (49, -0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.96561033 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
2 0.6561361 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
3 0.43976742 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
4 0.43433556 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
5 0.35886434 37 jmlr-2010-Evolving Static Representations for Task Transfer
Author: Phillip Verbancsics, Kenneth O. Stanley
Abstract: An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.
6 0.32181418 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
7 0.26471722 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
8 0.23125264 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
9 0.22942883 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
10 0.22364879 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
11 0.22100943 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
12 0.21191463 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
13 0.21120851 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.20309608 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks
15 0.19439191 93 jmlr-2010-PyBrain
16 0.18661144 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
18 0.1727961 25 jmlr-2010-Composite Binary Losses
19 0.17183699 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
20 0.1686206 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
topicId topicWeight
[(3, 0.026), (4, 0.011), (8, 0.016), (15, 0.015), (21, 0.03), (32, 0.053), (36, 0.063), (37, 0.028), (38, 0.491), (73, 0.011), (75, 0.091), (85, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.74099034 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
Author: Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, Xenofon D. Koutsoukos
Abstract: We present an algorithmic framework for learning local causal structure around target variables of interest in the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples. The selected feature sets can be used for causal discovery and classiÄ?Ĺš cation. The framework (Generalized Local Learning, or GLL) can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms. The resulting algorithms are sound under well-deÄ?Ĺš ned sufÄ?Ĺš cient conditions. In a Ä?Ĺš rst set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance. Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distribuc 2010 Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani and Xenofon D. Koutsoukos. A LIFERIS , S TATNIKOV, T SAMARDINOS , M ANI AND KOUTSOUKOS tions, types of classiÄ?Ĺš ers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we Ä?Ĺš nd that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show
3 0.32623792 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
Author: Jean-Yves Audibert, Sébastien Bubeck
Abstract: This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified γ analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x) = exp(ηx) + K , INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. γ η q On the other hand with ψ(x) = −x + K , which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays. Keywords: Bandits (adversarial and stochastic), regret bound, minimax rate, label efficient, upper confidence bound (UCB) policy, online learning, prediction with limited feedback.
4 0.28316239 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
5 0.26618075 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
6 0.26214638 37 jmlr-2010-Evolving Static Representations for Task Transfer
7 0.25806993 80 jmlr-2010-On-Line Sequential Bin Packing
8 0.25350073 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
9 0.25041544 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
10 0.24927013 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
11 0.24659012 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
12 0.24564143 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
13 0.24544826 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
14 0.2446012 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
16 0.2444468 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
17 0.24296118 63 jmlr-2010-Learning Instance-Specific Predictive Models
18 0.24221018 102 jmlr-2010-Semi-Supervised Novelty Detection
19 0.24179485 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
20 0.2413995 69 jmlr-2010-Lp-Nested Symmetric Distributions