jmlr jmlr2006 jmlr2006-75 knowledge-graph by maker-knowledge-mining

75 jmlr-2006-Policy Gradient in Continuous Time

Source: pdf

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefﬁcients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is deﬁned according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a ﬁnite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of ﬁnding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us deﬁne the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to ﬁnd a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A ﬁrst method would be to approximate the gradient by a ﬁnite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This ﬁnite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efﬁcient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us deﬁne the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is deﬁned as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefﬁcients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to inﬁnity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 deﬁned by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the ﬁrst coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus deﬁned according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the deﬁnition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Inﬁnite-horizon gradient-based policy search. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 FR Centre de Mathématiques Appliquées Ecole Polytechnique 91128 Palaiseau, France Editor: Michael Littman Abstract Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. [sent-3, score-0.164]

2 In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. [sent-4, score-0.753]

3 This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. [sent-5, score-0.784]

4 We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. [sent-6, score-0.435]

5 We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. [sent-7, score-0.922]

6 The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. [sent-8, score-0.682]

7 This derivation makes use of the knowledge of the state dynamics. [sent-9, score-0.109]

8 Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefﬁcients by quantities that solely depend on known data. [sent-10, score-0.735]

9 We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. [sent-11, score-0.576]

10 The method is illustrated on two target problems, in discrete and continuous control spaces. [sent-12, score-0.126]

11 Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. [sent-13, score-1.018]

12 Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. [sent-15, score-0.025]

13 We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a ﬁnite-time horizon T > 0. [sent-16, score-0.146]

14 For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. [sent-17, score-0.229]

15 M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. [sent-18, score-0.172]

16 Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. [sent-19, score-0.703]

17 The optimal control problem of ﬁnding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. [sent-20, score-0.361]

18 The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). [sent-21, score-1.72]

19 Let us deﬁne the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w. [sent-23, score-0.472]

20 One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. [sent-27, score-0.447]

21 In order to ﬁnd a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). [sent-28, score-0.272]

22 The computation of the gradient ∇αV (α) is the object of this paper. [sent-29, score-0.213]

23 A ﬁrst method would be to approximate the gradient by a ﬁnite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). [sent-30, score-0.213]

24 This ﬁnite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. [sent-31, score-0.06]

25 In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. [sent-34, score-0.309]

26 We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. [sent-36, score-0.54]

27 Indeed, let us deﬁne the gradient of the state with respect to the parameter: zt := ∇α xt (i. [sent-37, score-0.957]

28 zt is deﬁned as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w. [sent-39, score-0.715]

29 Our smoothness assumption on fα allows to differentiate the state dynamics (4) w. [sent-43, score-0.307]

30 α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefﬁcients ∇α fα and ∇x fα are, respectively, the derivatives of f w. [sent-46, score-0.301]

31 the parameter (matrix of size d × m) and the state (matrix of size d × d). [sent-49, score-0.071]

32 continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e. [sent-53, score-0.447]

33 (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . [sent-55, score-0.089]

34 (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . [sent-56, score-0.832]

35 The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . [sent-58, score-0.422]

36 In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w. [sent-59, score-1.475]

37 the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). [sent-62, score-0.325]

38 However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. [sent-63, score-0.351]

39 The agent knows only the response of the system to its control. [sent-64, score-0.115]

40 To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. [sent-65, score-0.499]

41 At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. [sent-66, score-0.5]

42 From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. [sent-67, score-0.51]

43 The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3. [sent-68, score-0.023]

44 In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). [sent-71, score-0.656]

45 ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. [sent-73, score-0.302]

46 Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). [sent-74, score-0.088]

47 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . [sent-75, score-0.402]

48 It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e. [sent-77, score-0.179]

49 (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al. [sent-79, score-0.122]

50 , 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. [sent-80, score-0.564]

51 The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to inﬁnity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). [sent-81, score-0.292]

52 Consider the deterministic continuous process (xt )0≤t≤1 deﬁned by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). [sent-83, score-0.358]

53 The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the ﬁrst coordinate of the state r((x y)′ ) := x. [sent-84, score-0.543]

54 Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. [sent-85, score-0.042]

55 Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0. [sent-86, score-0.155]

56 N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i. [sent-89, score-0.468]

57 , the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. [sent-91, score-0.071]

58 This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . [sent-95, score-0.096]

59 Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i. [sent-97, score-0.151]

60 Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . [sent-101, score-0.101]

61 s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . [sent-102, score-0.822]

62 (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). [sent-103, score-0.4]

63 (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). [sent-104, score-0.997]

64 s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). [sent-105, score-1.6]

65 Now, from the deﬁnition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). [sent-106, score-0.8]

66 Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). [sent-107, score-2.496]

67 Optimal control of a double inverted pendulum on a cart. [sent-124, score-0.067]

68 Approximate gradient methods in policy-space optimization of Markov reward processes. [sent-182, score-0.409]

69 Policy gradient methods for reinforcement learning with function approximation. [sent-212, score-0.335]

70 A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. [sent-228, score-0.235]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.447), ('ut', 0.4), ('xs', 0.273), ('zt', 0.226), ('policy', 0.221), ('gradient', 0.213), ('pathwise', 0.21), ('dynamics', 0.209), ('reward', 0.172), ('ird', 0.14), ('reinforcement', 0.122), ('agent', 0.115), ('dxt', 0.105), ('kushner', 0.105), ('nt', 0.101), ('trajectory', 0.096), ('deduce', 0.096), ('xn', 0.095), ('dt', 0.092), ('unos', 0.089), ('stochastic', 0.089), ('discretization', 0.087), ('sensitivity', 0.074), ('state', 0.071), ('radient', 0.07), ('control', 0.067), ('sutton', 0.067), ('olicy', 0.059), ('reiman', 0.059), ('winter', 0.059), ('ascent', 0.059), ('ime', 0.059), ('ontinuous', 0.059), ('terminal', 0.057), ('qt', 0.057), ('sure', 0.055), ('polytechnique', 0.053), ('marbach', 0.053), ('deduced', 0.053), ('munos', 0.053), ('maker', 0.049), ('parametric', 0.048), ('un', 0.046), ('likelihood', 0.046), ('ratio', 0.046), ('action', 0.044), ('explosion', 0.043), ('derivative', 0.042), ('baxter', 0.04), ('dynamical', 0.04), ('surely', 0.038), ('derivation', 0.038), ('simulation', 0.037), ('deterministic', 0.037), ('policies', 0.036), ('concentration', 0.033), ('differential', 0.033), ('discrete', 0.033), ('yang', 0.031), ('irm', 0.03), ('intends', 0.03), ('appliqu', 0.03), ('glynn', 0.03), ('gobet', 0.03), ('henriksen', 0.03), ('matiques', 0.03), ('palaiseau', 0.03), ('remi', 0.03), ('thesen', 0.03), ('ledoux', 0.03), ('talagrand', 0.03), ('mentioning', 0.03), ('providence', 0.03), ('decisions', 0.028), ('functional', 0.027), ('chichester', 0.027), ('differentiate', 0.027), ('kelton', 0.027), ('discretize', 0.027), ('timestep', 0.027), ('transpose', 0.027), ('roberts', 0.027), ('continuous', 0.026), ('dependency', 0.025), ('mi', 0.025), ('parameterized', 0.025), ('yin', 0.024), ('springerverlag', 0.024), ('contrasts', 0.024), ('littman', 0.024), ('mcallester', 0.024), ('connectionist', 0.024), ('differentiation', 0.024), ('optimization', 0.024), ('squares', 0.023), ('calculus', 0.023), ('controller', 0.023), ('horizon', 0.023), ('trajectories', 0.023), ('mcgraw', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

2 0.1913501 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

3 0.19001009 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classiﬁcation that, based on the past observed data, decides whether to ask the label of each new instance to be classiﬁed. In this paper, we introduce a general technique for turning linear-threshold classiﬁcation algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classiﬁers

4 0.18560642 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This uniﬁed view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any ﬁxed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

5 0.15885414 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classiﬁcations associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classiﬁer for each node of the taxonomy. A hierarchical classiﬁcation is obtained by evaluating the trained node classiﬁers in a top-down fashion. To evaluate classiﬁers in our multipath framework, we deﬁne a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classiﬁcation mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classiﬁer knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classiﬁers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classiﬁcation, second order perceptron, support vector machines, regret bound, loss function

6 0.14608203 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

7 0.13455091 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

8 0.13072564 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

9 0.11869781 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

10 0.11817406 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

11 0.11208703 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

12 0.075260736 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

13 0.068949819 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

14 0.066602327 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

15 0.05982434 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

16 0.059801664 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

17 0.058228254 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

18 0.054241512 45 jmlr-2006-Learning Coordinate Covariances via Gradients

19 0.045557197 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

20 0.037403405 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.271), (1, 0.531), (2, -0.042), (3, -0.044), (4, 0.165), (5, -0.059), (6, -0.057), (7, 0.062), (8, 0.006), (9, 0.054), (10, -0.063), (11, 0.049), (12, 0.057), (13, 0.053), (14, -0.008), (15, 0.062), (16, -0.055), (17, -0.008), (18, -0.033), (19, 0.08), (20, 0.028), (21, 0.034), (22, 0.009), (23, 0.008), (24, 0.003), (25, 0.045), (26, -0.009), (27, 0.077), (28, -0.09), (29, 0.034), (30, -0.059), (31, -0.063), (32, 0.028), (33, 0.021), (34, -0.03), (35, 0.043), (36, 0.001), (37, -0.093), (38, -0.037), (39, 0.015), (40, 0.008), (41, 0.016), (42, -0.064), (43, -0.023), (44, -0.085), (45, 0.026), (46, 0.001), (47, 0.048), (48, 0.023), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98733419 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

2 0.61479986 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

3 0.59322035 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

4 0.57251495 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

5 0.52537745 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) deﬁned on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally deﬁned on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration

6 0.51315266 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

7 0.48945683 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

8 0.46376732 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

9 0.44449708 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

10 0.43203434 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

11 0.34616461 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

12 0.27748114 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

13 0.26328281 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

14 0.26084813 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

15 0.25868693 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

16 0.20512572 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

17 0.19470616 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

18 0.19091436 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

19 0.17540839 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

20 0.16375154 45 jmlr-2006-Learning Coordinate Covariances via Gradients

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.013), (36, 0.054), (45, 0.048), (50, 0.035), (63, 0.023), (67, 0.58), (68, 0.014), (78, 0.012), (81, 0.018), (90, 0.026), (91, 0.025), (96, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85559654 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

2 0.52781582 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classiﬁcation algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially sufﬁces to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classiﬁers) in multiclass classiﬁcation. Our approach is, under some mild conditions, to establish a quantitative relationship between classiﬁcation errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classiﬁcation, classiﬁer, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.29497331 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

4 0.2841526 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

Abstract: We present Variable Inﬂuence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efﬁcient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can signiﬁcantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction

5 0.23125798 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical conﬁdence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it sufﬁces to pull the arms a total of O (n/ε2 ) log(1/δ) times to ﬁnd an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the conﬁdence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

6 0.19461519 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

7 0.19244058 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

8 0.17528924 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

9 0.17267656 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

10 0.15872999 48 jmlr-2006-Learning Minimum Volume Sets

11 0.14630952 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

12 0.13869549 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

13 0.13868164 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

14 0.1373671 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

15 0.13708484 70 jmlr-2006-Online Passive-Aggressive Algorithms

16 0.13600518 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

17 0.13482247 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

18 0.13471158 66 jmlr-2006-On Model Selection Consistency of Lasso

19 0.13354963 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

20 0.13311243 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting