jmlr jmlr2010 jmlr2010-2 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
Reference: text
sentIndex sentText sentNum sentScore
1 Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. [sent-10, score-0.898]
2 The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. [sent-11, score-0.982]
3 Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. [sent-12, score-0.465]
4 The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. [sent-14, score-0.513]
5 Keywords: actor critic, single time scale convergence, temporal difference 1. [sent-15, score-0.487]
6 A well known subclass of RL approaches consists of the so called actor-critic (AC) algorithms (Sutton and Barto, 1998), where the agent is divided into two components, an actor and a critic. [sent-20, score-0.526]
7 The critic functions as a value estimator, whereas the actor attempts to select actions based on the value estimated by the critic. [sent-21, score-0.875]
8 Instead of maintaining a separate estimate for the value for each state (or state-action pair), the agent maintains a parametrized policy function. [sent-29, score-0.279]
9 A more general approach based on sensitivity analysis, which includes policy gradient methods as well as non-parametric average reward functions, has been discussed in depth in the recent manuscript by Cao (2007). [sent-33, score-0.268]
10 As far as we are aware, all the convergence results for these algorithms are based on two time scales, specifically, the actor is assumed to update its internal parameters on a much slower time scale than the one used by the critic. [sent-36, score-0.494]
11 The intuitive reason for this time scale separation is clear, since the actor improves its policy based on the critic’s estimates. [sent-37, score-0.554]
12 It can be expected that rapid change of the policy parameters may not allow the critic to effectively evaluate the value function, which may lead to instability when used by the actor in order to re-update its parameters. [sent-38, score-0.954]
13 1 Direct Policy Gradient Algorithms Direct policy gradient algorithms, employing agents which consist of an actor only, typically estimate a noisy gradient of the average reward, and are relatively close in their characteristics to AC algorithms. [sent-64, score-0.59]
14 The gradient estimate is based on an estimate of the state values which the actor estimates while interacting with the environment. [sent-67, score-0.489]
15 If the actor returns to a sequence of previously visited states, it re-estimates the states value, not taking into account its previous visits. [sent-68, score-0.434]
16 2 Actor Critic Algorithms As stated in Section 1, the convergence proofs of which we are aware for AC algorithms are based on two time scale stochastic approximation (Borkar, 1997), where the actor is assumed to operate on a time scale which is much slower than that used by the critic. [sent-77, score-0.516]
17 In two of their algorithms (Algorithms 3 and 6), parametrized policy based actors were used while the critic was based on a lookup table. [sent-79, score-0.561]
18 The information passed from the critic to the actor is the critic’s action-value function, and the critic’s basis functions, which are explicitly used by the actor. [sent-82, score-0.879]
19 369 D I C ASTRO AND M EIR A drawback of the algorithm is that the actor and the critic must share the information regarding the actor’s parameters. [sent-84, score-0.856]
20 In this work the actor uses a parametrized policy function while the critic uses a function approximation for the state evaluation. [sent-88, score-1.021]
21 The critic passes to the actor the TD(0) signal and based on it the actor estimates the average reward gradient. [sent-89, score-1.431]
22 1 The Dynamics of the Environment and of the Actor We consider an agent, composed of an actor and a critic, interacting with an environment. [sent-101, score-0.434]
23 For each state x ∈ X the agent receives a corresponding reward r(x), which may be deterministic or random. [sent-105, score-0.259]
24 The average reward per stage of an agent which traverses a MC starting from an initial state x ∈ X is defined by J(x, θ) lim E T →∞ 1 T T −1 ∑ r(xn ) x0 = x, θ , n=0 where E[·|θ] denotes the expectation under the probability measure P(θ), and xn is the state at time n. [sent-133, score-0.588]
25 T It is easy to show that under Assumption 2, the average reward per stage can be expressed by η(θ) = lim E T →∞ 1 T T −1 ∑ r(xn ) x0 = x∗ , θ . [sent-151, score-0.254]
26 n=0 Next, we define the differential value function of state x ∈ X which represents the average differential reward the agent receives upon starting from a state x and reaching the recurrent state x∗ for the first time. [sent-152, score-0.414]
27 3 The Critic’s Dynamics The critic maintains an estimate of the environmental state values. [sent-171, score-0.47]
28 We note that h(x, θ) is a function of θ, and is a function of the state x ∈ X and a parameter w ∈ R ˜ is induced by the actor policy µ(u|x, θ), while h(x, w) is a function of w. [sent-174, score-0.558]
29 The actor chooses an action, un , according to the parametrized policy µ(u|x, θ). [sent-182, score-0.624]
30 Using the TD signal, the critic improves its estimation for the environment state values while the actor improves its policy. [sent-184, score-0.925]
31 The algorithm is composed of three iterates, one for the 373 D I C ASTRO AND M EIR actor and two for the critic. [sent-188, score-0.434]
32 The actor maintains the iterate of the parameter vector θ corresponding to the policy µ(u|x, θ), where its objective is to find the optimal value of θ, denoted by θ∗ , which maximizes η(θ). [sent-189, score-0.6]
33 θ of the average reward per stage can be expressed by ∇θ η(θ) = ∑ P(x, u, y, θ)ψ(x, u, θ)d(x, y, θ), (4) x,y∈X where P(x, u, y, θ) is the probability Pr(xn = x, un = u, xn+1 = y) subject to the policy parameter θ. [sent-211, score-0.322]
34 2 The Updates Performed by the Critic and the Actor We note that the following derivation regarding the critic is similar in some respects to the derivation in Section 6. [sent-213, score-0.422]
35 Thus, the explicit gradient descent procedure (8) is wn+1 = wn − γn Φ′ Π (θ) (Φwn − h(θ)) . [sent-255, score-0.319]
36 We denote by hn (x) the estimate of h(x, θ) at time step n, thus (9) becomes ˆ wn+1 = wn + γn Φ′ Π(θ) hn − Φwn . [sent-260, score-0.352]
37 This method devises an estimator which is based on previous estimates of h (w), that is, wn , and is based also on the environmental reward r (xn ). [sent-268, score-0.431]
38 m=0 The idea of bootstrapping is apparent in (11): the predictor for the differential value of the state ˆ xn at the (n + 1)-Th time step, is based partially on the previous estimates through hn (xn+k+1 ), and partially on new information, that is, the reward r (xn+m ). [sent-272, score-0.421]
39 In our setting, the non-discounted case, the analogous equations for the critic, are wn+1 = wn + γn d˜(xn , xn+1 , wn ) en ˜ ˜ ˜ d˜(xn , xn+1 , wn ) = r(xn ) − ηm + h(xn+1 , wm ) − h(xn , wm ) en = λen−1 + φ (xn ) . [sent-283, score-0.94]
40 Similarly to the critic, the actor executes a stochastic gradient ascent step in order to find a local maximum of the average reward per stage η(θ). [sent-285, score-0.636]
41 Therefore, θn+1 = θn + γn ψ(xn , un , θn )d˜n (xn , xn+1 , wn ), where ψ is defined in Section 4. [sent-286, score-0.341]
42 • An actor with a parametrized policy µ(u|x, θ) satisfying Assumptions 3 and 4. [sent-307, score-0.594]
43 ˜ • A critic with a linear basis for h(w), that is, {φ}L , satisfying Assumption 5. [sent-308, score-0.443]
44 For step n = 0 : ˜ • Initiate the critic and the actor variables: η0 = 0 ,w0 = 0, e0 = 0, θ0 = 0. [sent-312, score-0.856]
45 Critic: Calculate the estimated TD and eligibility trace ˜ ˜ ˜ ηn+1 = ηn + γn Γη (r(xn ) − ηn ) (13) ˜ h(x, wn ) = w′ φ(x), n ˜ ˜ ˜ d˜(xn , xn+1 , wn ) = r(xn ) − ηn + h(xn+1 , wn ) − h(xn , wn ), en = λen−1 + φ (xn ) . [sent-316, score-1.221]
46 Set, wn+1 = wn + γn Γw d˜(xn , xn+1 , wn ) en (14) θn+1 = θn + γn ψ(xn , un , θn )d˜n (xn , xn+1 , wn ) (15) Actor: Project each component of wm+1 onto H (see Definition 6) 4. [sent-317, score-0.956]
47 n→∞ γn The use of two time scales stems from the need of the critic to provide an accurate estimate of the state values, as in the work of Bhatnagar et al. [sent-361, score-0.467]
48 (2009), or the state-action values, as in the work of Konda and Tsitsiklis (2003) before the actor uses them. [sent-362, score-0.434]
49 We have γa = γn for the actor iterate, γc,η = Γη γn for the critic’s ηn iterate, and γc,w = Γw γn for the n n n critic’s w iterate. [sent-364, score-0.434]
50 n→∞ γn with the ave Due to the single time scale, Algorithm 1 has the potential to converge faster than algorithms based on two time scales, since both the actor and the critic may operate on the fast time scale. [sent-366, score-0.856]
51 (2009) and the present work, as compared to (Konda and Tsitsiklis, 2003), is the information passed from the critic to the actor. [sent-375, score-0.445]
52 For each state-action the reward is distributed normally with the state’s reward as mean and variance σ2 . [sent-392, score-0.282]
53 (2009) used critic steps γc,w and γc,η , and actor steps γa , where n n n γc,w = n 100 , 1000 + n2/3 γc,η = 0. [sent-416, score-0.856]
54 Note that an agent with a uniform action selection policy will attain an average reward per stage of zero in these problems. [sent-430, score-0.363]
55 More concretely, using the present approach may lead to faster initial convergence due to the single time scale setting, which allows both the actor and the critic to evolve rapidly, while switching smoothly to a two time scales approach as in Bhatnagar et al. [sent-439, score-0.935]
56 Discussion and Future Work We have introduced an algorithm where the information passed from the critic to the actor is the temporal difference signal, while the critic applies a TD(λ) procedure. [sent-445, score-1.332]
57 A policy gradient approach was used in order to update the actor’s parameters, based on a critic using linear function approximation. [sent-446, score-0.549]
58 The main contribution of this work is a convergence proof in a situation where both the actor 382 A C ONVERGENT O NLINE S INGLE T IME S CALE ACTOR C RITIC A LGORITHM 0. [sent-447, score-0.503]
59 Recent evidence suggested that the dorsal and ventral striatum may implement the actor and the critic, respectively Daw et al. [sent-499, score-0.434]
60 Define the conditioned average iterate E [Yn |Fn ] , gn (yn , xn ) and the corresponding martingale difference noise δMn Yn − E [Yn |Fn ] . [sent-584, score-0.501]
61 387 D I C ASTRO AND M EIR Thus, we can write the iteration as yn+1 = yn + γn (gn (yn , xn ) + δMn + Zn ) , where Zn is a reflection term which forces the iterate to the nearest point in the set H whenever the iterates leaves it (Kushner and Yin, 1997). [sent-585, score-0.357]
62 (b) gn (yn , x) is continuous in yn for each x and n. [sent-598, score-0.326]
63 (e) There are measurable and non-negative functions ρ3 (y) and ρn4 (x)such that gn (yn , x) ≤ ρ3 (y) ρn4 (x) where ρ3 (y) is bounded on each bounded y-set , and for each µ > 0 we have m( jτ+τ)−1 lim lim Pr sup τ→0 n→∞ ∑ j≥n i=m( jτ) 388 γi ρn4 (xi ) ≥ µ = 0. [sent-601, score-0.513]
64 4 is reminiscent of ergodicity, which is used to replace the state-dependent function gn (·, ·) with the state-independent of state function g (·), whereas Assump¯ tion 7. [sent-607, score-0.252]
65 For future purposes, we express Algorithm 1 using the augmented parameter vector yn yn θ′ n w′ n ′ ˜n η′ , θn ∈ RK , wn ∈ RL , ˜ ηn ∈ R. [sent-621, score-0.49]
66 The corresponding sub-vectors of g(yn ) ¯ will be denoted by ′ g (yn ) = g (θn )′ g (wn )′ g (ηn )′ ∈ RK+L+1 , ¯ ¯ ¯ ¯ ˜ and similarly gn (yn , xn ) = gn (θn , xn )′ gn (wn , xn )′ ′ ˜ gn (ηn , xn )′ ∈ RK+L+1 . [sent-623, score-1.664]
67 ˜ We begin by examining the components of gn (yn , xn ) and g (yn ). [sent-624, score-0.416]
68 The iterate gn (ηn , xn ) is ¯ ˜ ˜ gn (ηn , xn ) = E [ Γη (r (xn ) − ηn )| Fn ] ˜ = Γη (r (xn ) − ηn ) , and since there is no dependence on xn we have also ˜ g (ηn ) = Γη (η (θ) − ηn ) . [sent-625, score-1.068]
69 ¯ With some further algebra we can express this using (16), ˜ g (wn ) = A (θn ) wn + b (θn ) + G (θn ) (η (θn ) − ηn ) . [sent-628, score-0.29]
70 3 requires the continuity of gn (yn , xn ) for each n and xn . [sent-650, score-0.606]
71 ˜ ˜ Lemma 14 The function gn (ηn , xn ) is a continuous function of ηn for each n and xn . [sent-652, score-0.606]
72 ˜ ˜ Proof Since gn (ηn , xn ) = Γη (r (xn ) − ηn ) the claim follows. [sent-653, score-0.416]
73 ˜ Lemma 15 The function gn (wn , xn ) is a continuous function of ηn , wn , and θn for each n and xn . [sent-654, score-0.896]
74 Proof The function is ∞ ˜ gn (wn , xn ) = Γw ∑ λk φ (xn−k ) r (xn ) − ηn + k ∑ P (y|xn , θn ) φ (y)′ wn − φ (xn )′ wn . [sent-655, score-0.996]
75 Thus it is continuous in θn ˜ by Assumption 3, and thus gn (wn , xn ) is continuous in ηn and θn and the lemma follows. [sent-657, score-0.488]
76 ˜ Lemma 16 The function gn (θn , xn ) is a continuous function of ηn , wn , and θn for each n and xn . [sent-658, score-0.896]
77 Proof By definition, the function gn (θn , xn ) is gn (θn , xn ) = E d˜(xn , xn+1 , wn ) ψ (xn , un , θn ) Fn = ∇θ µ (un |xn , θn ) µ (un |xn , θn ) ˜ r (xn ) − ηn + ∑ P (y|xn , θn ) φ (y)′ wn − φ (xn )′ wn y∈X Using similar arguments to Lemma 15 the claim holds. [sent-659, score-1.753]
78 , where tm it the m-th time the agent hits the state x∗ . [sent-673, score-0.254]
79 Mathematically, we can define this series recursively by tm+1 = inf{n|xn = x∗ , n > tm }, and Tm t0 = 0, tm+1 − tm . [sent-674, score-0.272]
80 Using Doob’s martingale inequality2 we have ρ(k) ¯ ∑ ∑ γi (gn (y, xi ) − g (y)) ≥ µ Pr sup k≥n m=ρ(n) j∈Tm ≤ lim µ2 n→∞ ∑∞ m=ρ(n) E = lim ¯ ∑ j∈Tm γ j (gn (y, x j ) − g (y)) ∞ n→∞ 2 µ2 n→∞ ≤ lim 2 ¯ ∑∞ m=ρ(n) ∑ j∈Tm γ j (gn (y, x j ) − g (y)) E ∑ ¯m γ2 Bg BT /µ2 m=ρ(n) = 0. [sent-709, score-0.337]
81 Proof The claim is immediate from the fact that E (δMn )2 = E Yn − gn (yn , xn ) 2 ≤ 2E Yn 2 + gn (yn , xn ) 2 , and from Lemma 11, Lemma 12, and Lemma 13. [sent-717, score-0.832]
82 If wn is a martingale sequence then Pr supm≥0 |wn | ≥ µ ≤ limn→∞ E |wn |2 /µ2 . [sent-724, score-0.329]
83 395 D I C ASTRO AND M EIR where ρ3 (y) is bounded on each bounded y-set, and for each µ > 0 we have m( jτ+τ)−1 lim lim Pr sup τ→0 n→∞ ∑ j≥n i=m( jτ) γi ρn4 (xi ) ≥ µ = 0. [sent-725, score-0.287]
84 Lemma 19 If gn (y, x) is uniformly bounded for each y, x and n, then Assumption 7. [sent-729, score-0.287]
85 Thus m( jτ+τ)−1 lim lim Pr sup τ→0 n→∞ ∑ j≥n i=m( jτ) γi ρn4 (xi ) ≥ µ m( jτ+τ)−1 ∑ ≤ lim lim Pr sup τ→0 n→∞ j≥n i=m( jτ) γi B ≥ µ m( jτ+τ)−1 = lim lim Pr sup B τ→0 n→∞ j≥n ∑ i=m( jτ) γi ≥ µ ≤ lim Pr (Bτ ≥ µ) τ→0 = 0. [sent-734, score-0.732]
86 Based on Lemma 19, we are left with proving that gn (y, x) is uniformly bounded. [sent-735, score-0.252]
87 Lemma 20 The function gn (y, x) is uniformly bounded for all n. [sent-737, score-0.287]
88 Proof We examine the components of gn (yn , xn ). [sent-738, score-0.416]
89 In (20) we showed that ˜ ˜ gn (ηn , xn ) = Γη (r (xn ) − ηn ) . [sent-739, score-0.416]
90 ˜ Since both r (xn ) and ηn are bounded by Assumption 1 and Lemma 11 respectively, we have a ˜ uniform bound on gn (ηn , xn ). [sent-740, score-0.451]
91 Recalling (21) we have ∞ ˜ gn (wn , xn ) = Γw ∑ λk φ (xn−k ) r (xn ) − ηn + k=0 ∑ P (y|xn , θn ) φ (y)′ wn − φ (xn )′ wn y∈X 1 Bφ Br + Bη + 2Bφ Bw . [sent-741, score-0.996]
92 ≤ Γw ˜ 1−λ Finally, recalling (22) we have gn (θn , xn ) = E d˜(xn , xn+1 , wn ) ψ (xn , un , θn ) Fn ≤ Br + Bη + 2Bφ Bw Bψ . [sent-742, score-0.775]
93 Recalling (20) we have for gn (η, x) ˜ ˜ gn (η1 , x) − gn (η2 , x) ˜ ˜ ≤ Γη η 2 − η1 , thus (26) and (27) are satisfied for 1. [sent-756, score-0.678]
94 (b) Recalling (22) we have for gn (θ, x) gn (θ1 , x) − gn (θ2 , x) = E d˜(x, y, w1 ) ψ (x, u, θ1 ) Fn −E d˜(x, y, w2 ) ψ (x, u, θ2 ) Fn ≤ ∑ Bw y∈X P (y|x, θ1 ) − P (y|x, θ2 ) Bψ . [sent-760, score-0.678]
95 1−λ (1 − λ)2 d d Since the matrix entries are uniformly bounded in θ, so is the matrix dt A (θ)′ dt A (θ), and ′ d d so is the largest eigenvalue of dt A (θ) dt A (θ) which implies the uniform boundedness of d dt A (θ) 2 . [sent-892, score-0.485]
96 dt dt dt dt Rearranging it we get 0 = d d X (t)−1 = −X (t)−1 (X (t)) X (t)−1 . [sent-895, score-0.308]
97 dt dt Using this identity yields d A (θ)−1 dt d (A (θ)) A (θ)−1 dt 2 d (A (θ)) · −A (θ)−1 · dt 2 2 = −A (θ)−1 ≤ A (θ)−1 2 = b2 BA1 . [sent-896, score-0.385]
98 Examining the norm of d ∗ dt w d ∗ w dt yields d A (θ)−1 b (θ) dt 2 d d = A (θ)−1 b (θ) + A (θ)−1 b (θ) dt dt 1 ˜ |X |3 BΦ Br + bA B ≤ b2 BA1 A 1−λ = Bw1 . [sent-898, score-0.385]
99 ∞ D I C ASTRO AND M EIR With some more algebra we have lim sup t→∞ ∑ x,y∈X ×X ,u∈U ≤ lim sup t→∞ ∑ ˜ D(x,u,y) (θ) d(x, y, θ) − d(x, y, w) x,y∈X ×X ,u∈U ˜ π (x) P (u|x, θn ) P (y|x, u) ψ (x, u, θn ) · d(x, y, θ) − d(x, y, w) B∆η B∆h1 εapp +2 +√ Γη Γw bπ B∆td1 B∆td2 = + + B∆td3 εapp . [sent-921, score-0.272]
100 Temporal difference based actor critic algorithms single time scale convergence and neural implementation. [sent-1021, score-0.916]
wordName wordTfidf (topN-words)
[('actor', 0.434), ('critic', 0.422), ('wn', 0.29), ('gn', 0.226), ('bhatnagar', 0.191), ('xn', 0.19), ('eir', 0.145), ('reward', 0.141), ('br', 0.14), ('ingle', 0.139), ('onvergent', 0.139), ('tm', 0.136), ('tsitsiklis', 0.136), ('astro', 0.124), ('td', 0.119), ('garnet', 0.119), ('ritic', 0.118), ('cale', 0.107), ('rl', 0.105), ('yn', 0.1), ('app', 0.099), ('policy', 0.098), ('lgorithm', 0.096), ('konda', 0.092), ('agent', 0.092), ('nline', 0.092), ('ime', 0.082), ('lim', 0.081), ('ode', 0.08), ('rk', 0.077), ('dt', 0.077), ('bw', 0.076), ('lemma', 0.072), ('supn', 0.066), ('fn', 0.056), ('sup', 0.055), ('un', 0.051), ('pr', 0.049), ('bertsekas', 0.047), ('jt', 0.047), ('iterate', 0.046), ('kushner', 0.045), ('bg', 0.043), ('environment', 0.043), ('parametrized', 0.041), ('trajectory', 0.041), ('marbach', 0.04), ('martingale', 0.039), ('ac', 0.039), ('boundedness', 0.039), ('sutton', 0.039), ('convergence', 0.038), ('recurrent', 0.037), ('en', 0.035), ('bounded', 0.035), ('differential', 0.033), ('yin', 0.033), ('barto', 0.033), ('stage', 0.032), ('proof', 0.031), ('temporal', 0.031), ('hn', 0.031), ('det', 0.03), ('bn', 0.029), ('gradient', 0.029), ('jl', 0.029), ('mdp', 0.029), ('ba', 0.028), ('mc', 0.028), ('assumption', 0.028), ('neighborhood', 0.027), ('eligibility', 0.026), ('mokkadem', 0.026), ('regenerative', 0.026), ('reinforcement', 0.026), ('uniformly', 0.026), ('state', 0.026), ('cm', 0.024), ('passed', 0.023), ('bt', 0.023), ('cao', 0.023), ('baxter', 0.023), ('maintains', 0.022), ('theorem', 0.022), ('scale', 0.022), ('satisfying', 0.021), ('bd', 0.021), ('iterates', 0.021), ('lyapunov', 0.02), ('basal', 0.02), ('daw', 0.02), ('dicastro', 0.02), ('dyn', 0.02), ('ganglia', 0.02), ('pelletier', 0.02), ('mn', 0.019), ('scales', 0.019), ('actions', 0.019), ('transition', 0.018), ('recalling', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000012 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
2 0.08833988 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
3 0.07437814 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
4 0.071168616 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
5 0.066840217 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman
Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean field approximation
6 0.053744659 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
7 0.045750439 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
8 0.042363171 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
9 0.040524159 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
10 0.040135421 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
11 0.039301455 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
12 0.035908218 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
13 0.035786532 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
14 0.035283834 37 jmlr-2010-Evolving Static Representations for Task Transfer
15 0.034215782 69 jmlr-2010-Lp-Nested Symmetric Distributions
16 0.033813544 109 jmlr-2010-Stochastic Composite Likelihood
17 0.033679966 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
18 0.033123538 93 jmlr-2010-PyBrain
19 0.032733485 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
topicId topicWeight
[(0, -0.142), (1, -0.07), (2, 0.03), (3, -0.12), (4, -0.025), (5, -0.018), (6, -0.049), (7, 0.062), (8, 0.123), (9, 0.009), (10, 0.119), (11, 0.017), (12, 0.035), (13, -0.135), (14, 0.06), (15, 0.113), (16, 0.071), (17, -0.013), (18, -0.179), (19, -0.092), (20, 0.01), (21, 0.057), (22, -0.092), (23, -0.11), (24, 0.028), (25, 0.027), (26, -0.162), (27, -0.096), (28, 0.079), (29, 0.006), (30, -0.051), (31, 0.131), (32, -0.27), (33, 0.041), (34, 0.062), (35, -0.158), (36, -0.013), (37, -0.027), (38, 0.02), (39, 0.005), (40, -0.022), (41, 0.087), (42, -0.016), (43, 0.145), (44, -0.017), (45, -0.076), (46, 0.069), (47, -0.002), (48, -0.112), (49, -0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.94826609 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
2 0.51980954 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning
Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal
Abstract: With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies
3 0.44931683 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
4 0.3944391 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
5 0.34493655 37 jmlr-2010-Evolving Static Representations for Task Transfer
Author: Phillip Verbancsics, Kenneth O. Stanley
Abstract: An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.
6 0.32845369 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
7 0.31724623 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
8 0.27357209 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
9 0.26001558 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
10 0.25761265 93 jmlr-2010-PyBrain
11 0.25418496 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
12 0.23836835 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
13 0.21212621 109 jmlr-2010-Stochastic Composite Likelihood
14 0.20731969 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
15 0.19494142 35 jmlr-2010-Error-Correcting Output Codes Library
16 0.1854791 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
17 0.18369694 61 jmlr-2010-Learning From Crowds
18 0.17879923 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
19 0.16662626 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
20 0.16653827 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
topicId topicWeight
[(1, 0.012), (3, 0.037), (4, 0.013), (5, 0.407), (8, 0.015), (21, 0.024), (32, 0.062), (36, 0.07), (37, 0.034), (38, 0.036), (75, 0.118), (81, 0.014), (85, 0.042)]
simIndex simValue paperId paperTitle
1 0.6888141 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
same-paper 2 0.66599911 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
Author: Dotan Di Castro, Ron Meir
Abstract: Actor-Critic based approaches were among the first to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference
3 0.3526687 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
4 0.35207009 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
5 0.35008791 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
6 0.34389418 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
7 0.34315428 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
8 0.34229773 69 jmlr-2010-Lp-Nested Symmetric Distributions
9 0.34228936 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
10 0.3411932 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
11 0.33986428 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
12 0.33962157 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
13 0.33877325 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
14 0.33751518 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
15 0.33721402 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
16 0.33644372 63 jmlr-2010-Learning Instance-Specific Predictive Models
18 0.33496198 102 jmlr-2010-Semi-Supervised Novelty Detection
19 0.33491889 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
20 0.33403531 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding