jmlr jmlr2010 jmlr2010-4 knowledge-graph by maker-knowledge-mining

4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

Source: pdf

Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal

Abstract: With the goal to generate more scalable algorithms with higher efﬁciency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate signiﬁcant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2 ) offers currently one of the most efﬁcient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs. Keywords: stochastic optimal control, reinforcement learning, parameterized policies

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-JacobiBellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. [sent-7, score-0.494]

2 In the spirit of these latter ideas, this paper addresses a new method of probabilistic reinforcement learning derived from the framework of stochastic optimal control and path integrals, based on the original work of Kappen (2007) and Broek et al. [sent-24, score-0.392]

3 In stochastic optimal control (Stengel, 1994), the goal is to ﬁnd the controls ut that minimize the value function: (2) V (xti ) = Vti = min Eτi [R(τi )] , u ti :tN where the expectation Eτi [. [sent-48, score-0.614]

4 ] is taken over all trajectories starting at xti . [sent-49, score-0.475]

5 If we need to emphasize a particular time, we denote it by ti , which also simpliﬁes a transition to discrete time notation later. [sent-53, score-0.395]

6 , 2008) is that, since the weight control matrix R is inverse proportional to the variance of the noise, a high variance control input implies cheap control cost, while small variance control inputs have high control cost. [sent-69, score-0.49]

7 Applying the Feynman-Kac theorem, the solution of (9) is: Ψti = Eτi ΨtN e− tN 1 ti λ qt dt 1 1 = Eτi exp − φtN − λ λ tN ti qt dt . [sent-77, score-1.314]

8 With a view towards a discrete time approximation, which will be needed for numerical implementations, the solution (10) can be formulated as: Ψti = lim dt→0 p (τi |xi ) exp − 1 λ φtN + N−1 ∑ qt dt j dτi , (11) j=i where τi = (xti , . [sent-79, score-0.395]

9 , xtN ) is a sample path (or trajectory piece) starting at state xti and the term p (τi |xi ) is the probability of sample path τi conditioned on the start state xti . [sent-84, score-1.351]

10 Since Equation (11) provides the exponential cost to go Ψti in state xti , the integration above is taken with respect to sample paths τi = (xti , xti+1 , . [sent-85, score-0.538]

11 Subsequently, the passive dynamics term and the control transition (m) T (c) T + fti (c) fti (c) T T ] matrix can be partitioned as ft = [ft ft ]T with fm ∈ ℜk×1 , fc ∈ ℜl×1 and Gt = [0k×p Gt (c) with Gt ∈ ℜl×p . [sent-103, score-0.519]

12 The discretized state space representation of such systems is given as: √ xti+1 = xti + fti dt + Gti uti dt + dtεti , or, in partitioned vector form: (m) (m) (m) xti+1 (c) xti+1 = xti (c) xti √ uti dt + dtεti . [sent-104, score-2.396]

13 0k×p (c) Gti dt + (12) Essentially the stochastic dynamics are partitioned into controlled equations in which the state (m) is directly actuated and the uncontrolled equations in which the state xti+1 is not directly actuated. [sent-105, score-0.605]

14 , xti+1 |xti ) = ΠN−1 p xt j+1 |xt j , j=i where we exploited the fact that the start state xti of a trajectory is given and does not contribute to its probability. [sent-112, score-0.692]

15 For all practical purposes,2 the transition probability of the stochastic dynamics is reduced to the transition probability of the directly actuated part of the state: (c) p (τi |xti ) = ΠN−1 p xt j+1 |xt j ∝ ΠN−1 p xt j+1 |xt j . [sent-114, score-0.471]

16 Combining (15) and (14) 2 1 N−1 (c) (c) (c) ∑ xt j+1 − xt j − ft j dt Σt−1 . [sent-120, score-0.496]

17 1 ΠN−1 j=i (2π)l |Σt j | 1/2  exp − 1 2λ dt = Thus, we obtain: (c) xt j+1 N−1 ∑ (c) − xt j dt j=i 2 (c) − ft j Ht−1 j  dt . [sent-122, score-0.962]

18 ti ˜ The path cost S(τi ) is a generalized version of the path cost in Kappen (2005a) and Kappen (2007), which only considered systems with state independent control transition4 Gti . [sent-140, score-0.852]

19 (20) The equations in boxes (18), (19) and (20) form the solution for the generalized path integral stochastic optimal control problem. [sent-147, score-0.398]

20 Second, trajectories could be generated by a real system, and the noise εi would be computed from the difference be(c) ˆ ˙ ˙ ˙ tween the actual and the predicted system behavior, that is, Gti εi = xti − xti = xti − (fti + Gti uti ). [sent-153, score-1.547]

21 ˆ ti also requires a model of the system dynamics. [sent-154, score-0.382]

22 1 S YSTEMS W ITH O NE D IMENSIONAL D IRECTLY ACTUATED S TATE The generalized formulation of stochastic optimal control with path integrals in Table 1 can be applied to a variety of stochastic dynamical systems with different types of control transition matrices. [sent-164, score-0.557]

23 The control transition matrix thus becomes a row vector Gti = gti ∈ ℜ1×p . [sent-167, score-0.592]

24 According to (20), the local controls for such systems are expressed as follows: (c) uL (τi ) = R−1 gti (c)T (c) gti R−1 gti 3145 (c)T gti εti − bti . [sent-168, score-1.999]

25 (c) (c) Since the directly actuated part of the state is 1D, the vector xti collapses into the scalar xti (c) (c) which appears in the partial differentiation above. [sent-180, score-1.035]

26 In the case that gti does not depend on xti , the (c) differentiation with respect to xti results to zero and the the local controls simplify to: (c) (c)T uL (τi ) = 2. [sent-181, score-1.426]

27 (c) ti R−1 gti PARTIALLY ACTUATED S TATE The generalized formula of the local controls (20) was derived for the case where the control transi(c) tion matrix is state dependent and its dimensionality is Gt ∈ ℜl×p with l < n and p the dimensionality of the control. [sent-184, score-1.014]

28 Our generalized formulation allows a broader application of path integral control in areas like robotics and other control systems, where the control transition matrix is typically partitioned into directly and non-directly actuated states, and typically also state dependent. [sent-195, score-0.656]

29 Reinforcement Learning with Parameterized Policies Equipped with the theoretical framework of stochastic optimal control with path integrals, we can now turn to its application to reinforcement learning with parameterized policies. [sent-204, score-0.427]

30 The path integral approach from the previous sections also follows the classical time-based optimal control strategy, as can be seen from the time dependent solution for optimal controls in (33). [sent-213, score-0.401]

31 (2008) and applied in Koeber and Peters (2008), where the stochastic policy p(ati |xti ) is linearly parameterized as: ati = gtT (θ + εti ), i (25) with gti denoting a vector of basis functions and θ the parameter vector. [sent-228, score-0.685]

32 This policy has state dependent noise, which can contribute to faster learning as the signal-to-noise ratio becomes adaptive since it is a function of gti . [sent-229, score-0.601]

33 For Gaussian noise ε the probability of an action is p(ati |xti ) = N θT gti , Σti with Σti = gtT Σε gti . [sent-231, score-0.935]

34 3148 A G ENERALIZED PATH I NTEGRAL C ONTROL A PPROACH TO R EINFORCEMENT L EARNING in (25) with the control term in (3), one recognizes that the control policy formulation (25) should ﬁt into the framework of path integral optimal control. [sent-234, score-0.52]

35 In contrast to the previous example, the parameterized policy generates the desired trajectory in (29), and the differential equation for the desired trajectory is compatible with the path integral formalism. [sent-253, score-0.567]

36 What we would like to emphasize is that the control system’s structure is left to the creativity of its designer, and that path integral optimal control can be applied on various levels. [sent-254, score-0.416]

37 3, only the controlled differential equations of the entire control system contribute to the path integral formalism, that is, (28) in the ﬁrst example, or (29) in the second example. [sent-256, score-0.405]

38 In the example of (28), the dynamics model of the control system needs to be known to apply path integral optimal control, as this is a controlled differential equation. [sent-259, score-0.451]

39 , 2003) as a special case of parameterized policies, which are expressed by the differential equations: 1 zt ˙ τ 1 yt ˙ τ 1 xt ˙ τ ft = ft + gtT (θ + εt ), (30) = zt , = −αxt , = αz (βz (g − yt ) − zt ). [sent-266, score-0.429]

40  zt  =  yt ˙ (c) αz (βz (g − yt ) − zt ) yt ˙ gt T  (c) The state of the DMP is partitioned into the controlled part xt = yt and uncontrolled part (m) xt = (xt zt )T . [sent-284, score-0.547]

41 2 ∑ 2 j j=i N−1 ∑ j=i with Mt j = λ 2 R−1 gt j gtTj gtTj R−1 gt j Ht−1 j qt j + j=i = φtN + 2 (c) (c)T N−1 ∑ Ht−1 j λ N−1 log |Ht j | 2 ∑ j=i j N−1 ∑ qt dt + (c)T . [sent-287, score-0.578]

42 = P (τi ) The correction parameter vector δθti is deﬁned as δθti = P (τi ) (new) θti R−1 gti gti T εti dτi . [sent-303, score-0.914]

43 gti T R−1 gti (34) It is important to note that is now time dependent, that is, for every time step ti , a different optimal parameter vector is computed. [sent-304, score-1.294]

44 However, there would be a second update term due to the average over projected mean parameters θ from every time step—it should be noted that Mti is a projection matrix onto the range space of gti under the metric R−1 , such that a multiplication with Mti can only shrink the norm of θ. [sent-310, score-0.479]

45 However, this irrelevant component will not prevent us from reaching the optimal effective solution, that is, the solution that lies in the range space of gti . [sent-316, score-0.479]

46 Essentially, (35) computes a discrete probability at time ti of each trajectory roll-out with the help of the cost (36). [sent-321, score-0.497]

47 To be precise, θ would be projected and continue shrinking until it lies in the intersection of all null spaces of the gti basis function—this null space can easily be of measure zero. [sent-323, score-0.457]

48 Related Work In the next sections we discuss related work in the areas of stochastic optimal control and reinforcement learning and analyze the connections and differences with the PI2 algorithm and the generalized path integral control formulation. [sent-342, score-0.554]

49 (2008), the path integral formalism is extended for the stochastic optimal control of multi-agent systems. [sent-349, score-0.393]

50 25) – The basis function gti from the system dynamics (cf. [sent-361, score-0.533]

51 (N − 1), compute: ∗ δθti = ∑K [P (τi,k ) Mti ,k εti ,k ] k=1 ∑N−1 (N−i) w j,ti [δθ ] – Compute [δθ] j = i=0 N−1 w (N−i) ti j ∑i=0 j,ti – Update θ ← θ + δθ N−1 – Create one noiseless roll-out to check the trajectory cost R = φtN + ∑i=0 rti . [sent-370, score-0.532]

52 Furthermore it is shown that the class of discrete KL divergence control problem is equivalent to the continuous stochastic optimal control formalism with quadratic cost control function and under the presence of Gaussian noise. [sent-374, score-0.435]

53 In all this aforementioned work, both in the path integral formalism as well as in KL divergence control, the class of stochastic dynamical systems under consideration is rather restrictive since the 3155 T HEODOROU , B UCHLI AND S CHAAL control transition matrix is state independent. [sent-377, score-0.473]

54 Using the notation of this paper, the parameter update of PoWER becomes: δθ = Eτ0 gti gtT Rti T i ∑ gt gt i i=0 i N−1 −1 tN E τ0 ∑ ti =to Rti gti gtT εt i , gtT gti i gt gT where Rti = ∑N−1 rt j . [sent-429, score-2.119]

55 If we set R−1 = c I in the update (37) of PI2 , and set gTi gti = I in the matrix j=i ti ti inversion term of (39), the two algorithms look essentially identical. [sent-430, score-1.195]

56 Conclusions The path integral formalism for stochastic optimal control has a very interesting potential to discover new learning algorithms for reinforcement learning. [sent-657, score-0.479]

57 The term S(τi ) is ˜ a path function deﬁned as S(τi ) = S(τi ) + λ ∑N−1 log |Ht j | that satisﬁes the following condition 2 j=i 1 ˜ limdt→0 exp − λ S(τi ) dτi ∈ C (1) for any sampled trajectory starting from state xti . [sent-674, score-0.749]

58 Moreover the (c) (c) T term Ht j is given by Ht j = Gt j R−1 Gt j while the term S(τi ) is deﬁned according to (c) (c) 1 N−1 xt j+1 − xt j (c) − ft j S(τi ) = φtN + ∑ qt j dt + ∑ 2 j=i dt j=i N−1 2 Ht j dt. [sent-675, score-0.782]

59 Proof The optimal controls at the state xti is expressed by the equation uti = −R−1 Gti ∇xti Vti . [sent-676, score-0.715]

60 λ 1 ˜ Under the assumption that term exp − λ S(τi ) dτi is continuously differentiable in xti and dt we can change order of the integral with the differentiation operations. [sent-685, score-0.764]

61 ˜ exp − 1 S(τi ) dτi λ 3169 T HEODOROU , B UCHLI AND S CHAAL The denominator is a function of xti the current state and thus it can be pushed inside the integral of the nominator: 1 ˜ exp − λ S(τi ) 1˜ ∇xti − S(τi ) dτi . [sent-689, score-0.61]

62 By using uti = lim dt→0 ti ti these equations we will have that: −1 T −R [0 (c) Gti T ] ˜ ∇x(m) S(τi ) ti ˜ ∇ (c) S(τi ) dτi . [sent-694, score-1.304]

63 xti The equation above can be written in the form: uti = lim −1 T dt→0 −[0 R (c) Gti T ] p (τi ) ˜ xti or uti = lim dt→0 (c) T −[0T R−1 Gti ˜ p (τi ) · ∇x(m) S(τi )dτi ˜ ti ˜ p (τi ) · ∇ (c) S(τi )dτi ˜ ] . [sent-696, score-1.67]

64 xti Therefore we will have the result (c) T uti = lim −R−1 Gti dt→0 ˜ p (τi ) ∇x(c) S(τi )dτi . [sent-697, score-0.656]

65 More precisely we have shown that to ˜ S(τi ) = φtN + (c) (c) 1 N−1 xt j+1 − xt j (c) − ft j qt j dt + ∑ ∑ 2 j=i dt j=i N−1 2 Ht j dt + λ N−1 log |Ht j |. [sent-703, score-1.002]

66 By calculating the term ti ti ˜ ∇x(c) S(τo ) we can ﬁnd the local controls u(τi ). [sent-708, score-0.777]

67 ti D ERIVATIVE OF THE 2 TH T ERM ∇x(c) ti 1 2dt N−1 ∑i=1 γti OF THE COST The second term can be found as follows: ∇x(c) ti 1 N−1 γt . [sent-711, score-1.074]

68 2dt ∑ xt j j=i (c) Terms that do not depend on xti drop and thus we will have: 1 ∇ (c) γt . [sent-714, score-0.557]

69 2dt xti i Substitution of the parameter γti = αtT Ht−1 αti will result in: i i 1 ∇ (c) αtT Ht−1 αti . [sent-715, score-0.454]

70 i i 2dt xti By making the substitution βti = Ht−1 αti and applying the rule ∇ u(x)T v(x) = ∇ (u(x)) v(x)+ i ∇ (v(x)) u(x) we will have that: 1 ∇x(c) αti βti + ∇x(c) βti αti . [sent-716, score-0.454]

71 ti ti 2dt (40) Next we ﬁnd the derivative of αto : (c) (c) ∇x(c) αti = ∇x(c) xti+1 − xti − fc (xti )dt . [sent-717, score-1.17]

72 ti ti and the result is (c) ∇x(c) αti = −Il×l − ∇x(c) fti dt. [sent-718, score-0.812]

73 ti ti We substitute back to (40) and we will have: 1 (c) − Il×l + ∇x(c) fti dt ti 2dt − 1 2dt (c) Il×l + ∇x(c) fti dt ti After some algebra the result of ∇x(c) ti − 1 2dt βti + ∇x(c) βti αti . [sent-719, score-2.422]

74 ti 2dt N−1 ∑i=1 γti is expressed as: 1 1 1 (c) βti − ∇x(c) fti βti + ∇ (c) β αt . [sent-721, score-0.473]

75 ti 2dt 2 2dt xti ti i The next step now is to ﬁnd the limit of the expression above as dt → 0. [sent-722, score-1.39]

76 2dt ti 2 xti ti 2dt xti ti i 3172 A G ENERALIZED PATH I NTEGRAL C ONTROL A PPROACH TO R EINFORCEMENT L EARNING 1 L IMIT OF THE F IRST S UBTERM : − 2dt βti We will continue our analysis by ﬁnding the limit for each one of the 3 terms above. [sent-724, score-1.982]

77 The limit of the ﬁrst term is calculated as follows: lim dt→0 − 1 β 2dt ti 1 H−1 αti 2dt ti = − lim dt→0 1 −1 H 2 ti 1 = − Ht−1 i 2 lim αti =− (c) L IMIT OF THE S ECOND S UBTERM : − 1 ∇x(c) fti 2 dt→0 lim dt→0 (c) (c) (xti+1 − xti ) 1 (c) − fti . [sent-725, score-2.052]

78 dt βti ti The limit of the second term is calculated as follows: 1 (c) ∇ (c) f βti 2 xti ti − lim dt→0 1 = − ∇x(c) fc (xti ) lim βti dt→0 2 ti 1 (c) = − ∇x(c) fti lim Ht−1 αti i dt→0 2 ti 1 = − ∇x(c) fc (xti ) Ht−1 lim αti i dt→0 2 ti = 0. [sent-726, score-2.892]

79 The limit of the term limdt→0 αti is derived as: (c) (c) (c) (c) lim xti+1 − xti − fc (xti )dt = lim xtti +dt − xti dt→0 dt→0 L IMIT OF THE T HIRD S UBTERM : 1 2dt − lim fc (xti )dt = 0 − 0 = 0. [sent-727, score-1.157]

80 dt→0 ∇x(c) βti αti ti Finally the limit of the third term can be found as: lim dt→0 1 ∇ (c) β αt 2dt xti ti i = lim ∇x(c) βti lim dt→0 ti = lim ∇x(c) βti dt→0 ti dt→0 = 1 αt 2dt i = 1 (c) (c) 1 (c) lim (xti+1 − xti ) − fti . [sent-728, score-2.851]

81 2 dt→0 dt We substitute βti = Ht−1 αti and write the matrix Ht−1 in row form: i i 3173 T HEODOROU , B UCHLI AND S CHAAL 1 (c) (c) 1 (c) = = lim ∇x(c) Ht−1 αti lim (xti − xti ) − fti i ti dt→0 2 dt→0 dt    (1)−T Hti   (2)−T     H    ti    . [sent-729, score-1.872]

82  αt  1 lim (xt(c) − xt(c) ) 1 − ft(c)  = lim ∇x(c)  i i i+1  i  2 dt→0 ti dt→0 dt . [sent-730, score-0.744]

83 ti  (1)−T ∇T(c) Hti αti  xti −T  T  ∇ (c) Ht(2) αti i  xti  . [sent-737, score-1.266]

84  (l)−T T αti ∇ (c) Hti xti       1 (c) (c) 1 (c)  lim  2 dt→0 (xti+1 − xti ) dt − fti . [sent-740, score-1.307]

85     We again use the rule ∇ u(x)T v(x) = ∇ (u(x)) v(x) + ∇ (v(x)) u(x) and thus we will have:         = lim  dt→0       (1)−T ∇x(c) Hti ti (2)−T ∇x(c) Hti ti αti + ∇x(c) αti ti (1)−T Hti (2)−T αti + ∇x(c) αti Hti T T ti . [sent-741, score-1.515]

86 (l)−T ∇x(c) Hti ti (l)−T αti + ∇x(c) αti Hti ti T         (c) (c) (c) 1  lim (xti+1 − xti ) − fti . [sent-744, score-1.349]

87    (l)−T Hti   ti −T     ∇ (c) Ht(2)   xti i   . [sent-748, score-0.812]

88 2 dt→0 dt = −Il×l the ﬁnal result is expressed as fol- lows lim dt→0 1 ∇ (c) β αt 2dt xti ti i 1 (c) (c) 1 (c) lim (xti+1 − xti ) − fto . [sent-752, score-1.671]

89 log |Hti | = ∂x(ci) ti 2 2 |Hti | xti ∂x(ci) ti λ λ 1 |Hti | trace Ht−1 · ∂x(ci) Hti . [sent-761, score-1.203]

90 log |Hti | = i ti 2 2 |H(xti )| ∂x(ci) ti λ λ log |Hti | = trace Ht−1 ∂x(ci) Hti . [sent-762, score-0.749]

91 The result is expressed as:  ∇x(c) ti  ∂x(c1) Hti  trace ti    trace Ht−1 ∂ (c2) Hti i  xti λ λ . [sent-764, score-1.255]

92       ti or in a more compact form: ∇x(c) ti λ log |Hti | = Ht−1 bti . [sent-768, score-0.807]

93 i 2 where b(xti ) = λH(xti )Φti and the quantity Φti ∈ ℜl×1 is deﬁned as:  −1  trace Hti ∂xt(c1) Hti i    trace Ht−1 ∂ (c2) Hti i  xti 1  . [sent-769, score-0.52]

94   −1 trace Hti ∂x(cl) Hti ti        . [sent-772, score-0.391]

95       (41) ˜ Since we computed all the terms of the derivative of the path cost S(τo ) and after putting all the terms together we have the result expressed as follows: lim dt→0 ˜ ∇x(c) S(τi ) = −Ht−1 i ti (c) (c) (xti+1 − xti ) lim dt→0 (c) (c) 1 (c) − fti − bti . [sent-773, score-1.362]

96 dt (c) 1 By taking into account the fact that limdt→0 (xti+1 − xti ) dt − fti ing ﬁnal expression: lim (c) = Gti εti we get the follow- (c) dt→0 ˜ ∇x(c) S(τi ) = −Ht−1 Gti εti − bti . [sent-774, score-1.164]

97 (c) (c) (o) 1 (xti+1 − xti ) dt − fti (c) Gti εti − bti are the local controls of each samThe terms εti and bti are deﬁned as εti = (c) T (c) and b(xti ) = λH(xti )Φti with Hti = Gti R−1 Gti and Φti given in (41). [sent-776, score-1.013]

98 ˜ i R−1 Gti ti The terms R−1 and Gti can be pushed insight the integral since they are independent of τi = (x1 , x2 , . [sent-779, score-0.422]

99 Thus we have the expression: (c) T ˜ Ht−1 ∇x(c) S(τi ) dτi , i p (τi ) R−1 Gti ˜ uti = lim dt→0 ti (dt) uti = lim p (τi ) uL (τi ) dτi , ˜ dt→0 (dt) where the local controls uL (τi ) are given as follows: (dt) (c) T uL (τi ) = R−1 Gti ˜ Ht−1 ∇x(c) S(τi ). [sent-783, score-0.823]

100 An introduction to stochastic control theory, path integrals and reinforcement learning. [sent-898, score-0.409]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gti', 0.457), ('xti', 0.454), ('ti', 0.358), ('dt', 0.22), ('hti', 0.2), ('ht', 0.16), ('path', 0.134), ('gtt', 0.13), ('uti', 0.119), ('gt', 0.113), ('policy', 0.104), ('xt', 0.103), ('control', 0.098), ('chaal', 0.096), ('fti', 0.096), ('heodorou', 0.096), ('ontrol', 0.096), ('uchli', 0.096), ('trajectory', 0.095), ('robot', 0.092), ('bti', 0.091), ('peters', 0.087), ('actuated', 0.087), ('tn', 0.087), ('reinforcement', 0.086), ('lim', 0.083), ('einforcement', 0.082), ('ntegral', 0.082), ('ul', 0.078), ('motor', 0.071), ('ft', 0.07), ('pproach', 0.068), ('qt', 0.066), ('integral', 0.064), ('kappen', 0.064), ('eneralized', 0.063), ('controls', 0.061), ('movement', 0.057), ('stochastic', 0.052), ('dynamics', 0.052), ('broek', 0.048), ('schaal', 0.047), ('cost', 0.044), ('dof', 0.041), ('uncontrolled', 0.041), ('differential', 0.04), ('state', 0.04), ('dmp', 0.039), ('hjb', 0.039), ('integrals', 0.039), ('rl', 0.039), ('policies', 0.038), ('transition', 0.037), ('ati', 0.037), ('reinforce', 0.037), ('parameterized', 0.035), ('primitives', 0.035), ('rti', 0.035), ('todorov', 0.035), ('reward', 0.034), ('trace', 0.033), ('earning', 0.033), ('tt', 0.033), ('command', 0.032), ('dog', 0.03), ('mti', 0.03), ('pde', 0.03), ('rt', 0.029), ('equations', 0.028), ('jump', 0.028), ('exp', 0.026), ('koeber', 0.026), ('limdt', 0.026), ('dynamical', 0.025), ('system', 0.024), ('formalism', 0.023), ('ut', 0.023), ('xtn', 0.022), ('update', 0.022), ('optimal', 0.022), ('buchli', 0.022), ('dmps', 0.022), ('enac', 0.022), ('gpomdp', 0.022), ('xvt', 0.022), ('qd', 0.022), ('noise', 0.021), ('trajectories', 0.021), ('horizon', 0.02), ('robots', 0.02), ('exploration', 0.02), ('mt', 0.02), ('yt', 0.019), ('expressed', 0.019), ('trials', 0.018), ('sutton', 0.018), ('zt', 0.018), ('controlled', 0.017), ('dofs', 0.017), ('nac', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal

2 0.12627336 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to speciﬁc data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization

3 0.10406546 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

Author: Yu Fan, Jing Xu, Christian R. Shelton

Abstract: A continuous time Bayesian network (CTBN) uses a structured representation to describe a dynamic system with a ﬁnite number of states which evolves in continuous time. Exact inference in a CTBN is often intractable as the state space of the dynamic system grows exponentially with the number of variables. In this paper, we ﬁrst present an approximate inference algorithm based on importance sampling. We then extend it to continuous-time particle ﬁltering and smoothing algorithms. These three algorithms can estimate the expectation of any function of a trajectory, conditioned on any evidence set constraining the values of subsets of the variables over subsets of the time line. We present experimental results on both synthetic networks and a network learned from a real data set on people’s life history events. We show the accuracy as well as the time efﬁciency of our algorithms, and compare them to other approximate algorithms: expectation propagation and Gibbs sampling. Keywords: continuous time Bayesian networks, importance sampling, approximate inference, ﬁltering, smoothing

4 0.079895809 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman

Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean ﬁeld variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efﬁciently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efﬁcient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean ﬁeld approximation

5 0.076992311 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

Author: Thomas Jaksch, Ronald Ortner, Peter Auer

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a ﬁxed number of ℓ times. We present a modiﬁcation of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity

6 0.073750786 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

7 0.071168616 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

8 0.067876503 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

9 0.05603952 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

10 0.040251039 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

11 0.038892698 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

12 0.037990101 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes

13 0.037566043 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

14 0.035966899 93 jmlr-2010-PyBrain

15 0.035392825 37 jmlr-2010-Evolving Static Representations for Task Transfer

16 0.035059836 80 jmlr-2010-On-Line Sequential Bin Packing

17 0.034438401 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

18 0.033862036 82 jmlr-2010-On Learning with Integral Operators

19 0.032310683 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

20 0.031892441 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.148), (1, -0.043), (2, -0.034), (3, -0.194), (4, 0.087), (5, -0.026), (6, -0.032), (7, 0.007), (8, 0.144), (9, -0.058), (10, 0.126), (11, 0.073), (12, 0.081), (13, -0.04), (14, 0.082), (15, 0.135), (16, 0.089), (17, 0.101), (18, -0.167), (19, -0.089), (20, -0.092), (21, 0.141), (22, -0.176), (23, -0.144), (24, 0.06), (25, -0.015), (26, -0.062), (27, -0.148), (28, 0.054), (29, -0.037), (30, -0.224), (31, 0.222), (32, 0.136), (33, 0.06), (34, -0.083), (35, -0.065), (36, -0.036), (37, -0.093), (38, 0.153), (39, 0.113), (40, 0.03), (41, -0.112), (42, -0.02), (43, 0.007), (44, 0.1), (45, -0.019), (46, 0.027), (47, -0.063), (48, -0.041), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98005652 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal

2 0.61100852 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

3 0.50693601 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

Author: Dotan Di Castro, Ron Meir

Abstract: Actor-Critic based approaches were among the ﬁrst to address reinforcement learning in a general setting. Recently, these algorithms have gained renewed interest due to their generality, good convergence properties, and possible biological relevance. In this paper, we introduce an online temporal difference based actor-critic algorithm which is proved to converge to a neighborhood of a local maximum of the average reward. Linear function approximation is used by the critic in order estimate the value function, and the temporal difference signal, which is passed from the critic to the actor. The main distinguishing feature of the present convergence proof is that both the actor and the critic operate on a similar time scale, while in most current convergence proofs they are required to have very different time scales in order to converge. Moreover, the same temporal difference signal is used to update the parameters of both the actor and the critic. A limitation of the proposed approach, compared to results available for two time scale convergence, is that convergence is guaranteed only to a neighborhood of an optimal value, rather to an optimal value itself. The single time scale and identical temporal difference signal used by the actor and the critic, may provide a step towards constructing more biologically realistic models of reinforcement learning in the brain. Keywords: actor critic, single time scale convergence, temporal difference

4 0.32314876 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally speciﬁed model structure. ISBNs are speciﬁcally targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies beneﬁts from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efﬁcient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-ﬁeld approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artiﬁcial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks

5 0.30299598 37 jmlr-2010-Evolving Static Representations for Task Transfer

Author: Phillip Verbancsics, Kenneth O. Stanley

Abstract: An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to ﬁt the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.

6 0.30295309 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

7 0.27663061 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

8 0.24678503 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

9 0.22436275 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

10 0.2089003 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

11 0.2057969 80 jmlr-2010-On-Line Sequential Bin Packing

12 0.20111512 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

13 0.19079512 93 jmlr-2010-PyBrain

14 0.19012122 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

15 0.18997729 63 jmlr-2010-Learning Instance-Specific Predictive Models

16 0.16139904 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

17 0.16097689 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

18 0.16020508 30 jmlr-2010-Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting

19 0.15610561 77 jmlr-2010-Model-based Boosting 2.0

20 0.14957859 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.018), (8, 0.011), (15, 0.027), (21, 0.011), (27, 0.452), (32, 0.044), (36, 0.069), (37, 0.043), (38, 0.01), (75, 0.119), (81, 0.017), (85, 0.037), (96, 0.011), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70308709 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

Author: Evangelos Theodorou, Jonas Buchli, Stefan Schaal

2 0.3106336 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

Author: Yu Fan, Jing Xu, Christian R. Shelton

3 0.31019977 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

4 0.3087368 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman

5 0.30747738 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

6 0.30562922 63 jmlr-2010-Learning Instance-Specific Predictive Models

7 0.30493134 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

8 0.30425495 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

9 0.30185121 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

10 0.30139178 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

11 0.30028465 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

12 0.29993883 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

13 0.29903817 69 jmlr-2010-Lp-Nested Symmetric Distributions

14 0.29828122 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

15 0.29777709 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

16 0.29729256 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

17 0.29719177 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

18 0.29672971 56 jmlr-2010-Introduction to Causal Inference

19 0.29665709 109 jmlr-2010-Stochastic Composite Likelihood

20 0.29661638 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide