nips nips2007 nips2007-163 knowledge-graph by maker-knowledge-mining

163 nips-2007-Receding Horizon Differential Dynamic Programming


Source: pdf

Author: Yuval Tassa, Tom Erez, William D. Smart

Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Receding Horizon Differential Dynamic Programming Yuval Tassa ∗ Tom Erez & Bill Smart † Abstract The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. [sent-1, score-0.219]

2 Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. [sent-2, score-0.074]

3 In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. [sent-3, score-0.327]

4 We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. [sent-4, score-0.191]

5 These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. [sent-5, score-0.076]

6 1 Introduction We are interested in learning controllers for high-dimensional, highly non-linear dynamical systems, continuous in state, action, and time. [sent-6, score-0.391]

7 Local methods do not model the value function or policy over the entire state space by focusing computational effort along likely trajectories. [sent-8, score-0.107]

8 Featuring algorithmic complexity polynomial in the dimension, local methods are not directly affected by dimensionality issues as space-filling methods. [sent-9, score-0.111]

9 In this paper, we introduce Receding Horizon DDP (RH-DDP), a set of modifications to the classic DDP algorithm, which allows us to construct stable and robust controllers based on local-control trajectories in highly non-linear, high-dimensional domains. [sent-10, score-0.385]

10 We aggregate several such trajectories into a library of locally-optimal linear controllers which we then select from, using a nearest-neighbor rule. [sent-12, score-0.429]

11 We define a reward function which represents a global measure of performance relative to a high level objective, such as swimming towards a target. [sent-15, score-0.334]

12 Rather than a reward based on distance from a given desired configuration, a notion which has its roots in the control community’s definition of the problem, this global reward dispenses with a “path planning” component and requires the controller to solve the entire problem. [sent-16, score-0.442]

13 We demonstrate the utility of our approach by learning controllers for a high-dimensional simulation of a planar, multi-link swimming robot. [sent-17, score-0.474]

14 The swimmer is a model of an actuated chain of links in a viscous medium, with two location and velocity coordinate pairs, and an angle and angular velocity for each link. [sent-18, score-0.547]

15 The controller must determine the applied torque, one action dimension for ∗ † Y. [sent-19, score-0.085]

16 We reward controllers that cause the swimmer to swim to a target, brake on approach and come to a stop over it. [sent-27, score-0.622]

17 We synthesize controllers for several swimmers, with state dimensions ranging from 10 to 24 dimensions. [sent-28, score-0.353]

18 The controllers are shown to exhibit complex locomotive behaviour in response to real-time simulated interaction with a user-controlled target. [sent-29, score-0.315]

19 1 Related work Optimal control of continuous non-linear dynamical systems is a central research goal of the RL community. [sent-31, score-0.216]

20 Methods designed to alleviate the curse of dimensionality include adaptive discretizations of the state space [1], and various domain-specific manipulations [2] which reduce the effective dimensionality. [sent-33, score-0.076]

21 Although DDP is used there, it is considered an aid to the global approximator, and the local controllers are constant rather than locally-linear. [sent-35, score-0.394]

22 In [4] the idea of using the second order local DDP models to make locally-linear controllers is introduced. [sent-37, score-0.357]

23 In [6] a minimax variant of DDP is used to learn a controller for bipedal walking, again by designing a reference trajectory and rewarding the walker for tracking it. [sent-39, score-0.303]

24 In [7], trajectory-based methods including DDP are examined as possible models for biological nervous systems. [sent-40, score-0.087]

25 The best known work regarding the swimming domain is that by Ijspeert and colleagues (e. [sent-42, score-0.191]

26 While the inherently stable domain of swimming allows for such open-loop control schemes, articulated complex behaviours such as turning and tracking necessitate full feedback control which CPGs do not provide. [sent-45, score-0.499]

27 1 Definition of the problem We consider the discrete-time dynamics xk+1 = F (xk , uk ) with states x ∈ Rn and actions u ∈ Rm . [sent-47, score-0.196]

28 ∆t In this context we assume F (xk, uk ) = xk + 0 f (x(t), uk )dt for a continuous f and a small ∆t, approximating the continuous problem and identifying with it in the ∆t → 0 limit. [sent-48, score-0.675]

29 Given some scalar reward function r(x, u) and a fixed initial state x1 (superscripts indicating the time index), we wish to find the policy which maximizes the total reward1 acquired over a finite temporal horizon: N π ∗ (xk , k) = argmax[ π(·,·) r(xi , π(xi , i))]. [sent-49, score-0.213]

30 This is a manifestation of the dynamic programming principle. [sent-52, score-0.118]

31 2 DDP Differential Dynamic Programming [12, 13] is an iterative improvement scheme which finds a locally-optimal trajectory emanating from a fixed starting point x1 . [sent-55, score-0.21]

32 2 imation to the time-dependent value function is constructed along the current trajectory {xk }N , k=1 which is formed by iterative application of F using the current control sequence {uk }N . [sent-57, score-0.255]

33 In the backward sweep, we proceed backwards in time to generate local models of V in the following manner. [sent-59, score-0.106]

34 Thus, equations (4), (5) and (6) iterate in the backward sweep, computing a local model of the Value function along with a modification to the policy in the form of an open-loop term −Q−1 Qu and a feedback term uu −Q−1 Qux δx, essentially solving a local linear-quadratic problem in each step. [sent-63, score-0.369]

35 In some senses, DDP uu can be viewed as dual to the Extended Kalman Filter (though employing a higher order expansion of F ). [sent-64, score-0.089]

36 In the forward sweep of the DDP iteration, both the open-loop and feedback terms are combined to create a new control sequence (ˆk )N which results in a new nominal trajectory (ˆk )N . [sent-65, score-0.326]

37 u k=1 x k=1 x1 = x1 ˆ k k u =u ˆ x ˆ k+1 (7a) − Q−1 Qu uu k k − = F (ˆ , u ) x ˆ Q−1 Qux (ˆk x uu k −x ) (7b) (7c) We note that in practice the inversion in (5) must be conditioned. [sent-66, score-0.178]

38 In practice, the greater part of the computational effort is devoted to the measurement of the dynamical quantities in (4) or in the propagation of collocation vectors as described below. [sent-72, score-0.176]

39 DDP is a second order algorithm with convergence properties similar to, or better than Newton’s method performed on the full vectorial uk with an exact N m × N m Hessian [16]. [sent-73, score-0.196]

40 Instead of using (4), we fit this quadratic model to samples of the value function at a cloud of collocation vectors {xk , uk }i=1. [sent-79, score-0.427]

41 In addition, this method allows us to more easily apply general coordinate transformations in order to represent V in some internal space, perhaps of lower dimension. [sent-88, score-0.074]

42 Because we can choose {xk , uk } in way which makes the linear system i i sparse, we can enjoy the γ2 < γ1 of sparse methods and, at least for the experiments performed here, increase the running time only by a small factor. [sent-90, score-0.196]

43 In the same manner that DDP is dually reminiscent of the Extended Kalman Filter, this method bears a resemblance to the test vectors propagated in the Unscented Kalman Filter [18], although we use a quadratic, rather than linear number of collocation vectors. [sent-91, score-0.143]

44 3 Receding Horizon DDP When seeking to synthesize a global controller from many local controllers, it is essential that the different local components operate synergistically. [sent-93, score-0.301]

45 In our context this means that local models of the value function must all model the same function, which is not the case for the standard DDP solution. [sent-94, score-0.074]

46 The local quadratic models which DDP computes around the trajectory are approximations to V (x, k), the time-dependent value function. [sent-95, score-0.283]

47 Having computed a DDP solution to some problem starting from many different starting points x1 , we can discard all the models computed for points xk>1 and save only the ones around the x1 ’s. [sent-98, score-0.091]

48 Because this control sequence is very close to the optimal solution, the second-order convergence of DDP is in full effect and the algorithm converges in 1 or 2 sweeps. [sent-104, score-0.108]

49 4 Nearest Neighbor control with Trajectory Library A run of DDP computes a locally quadratic model of V and a locally linear model of u, expressed by the gain term −Q−1 Qux . [sent-108, score-0.17]

50 This term generalizes the open-loop policy to a tube around the trajectory, uu inside of which a basin-of-attraction is formed. [sent-109, score-0.157]

51 Having lost the dependency on the time k with the receding-horizon scheme, we need some space-based method of determining which local gain model we select at a given state. [sent-110, score-0.074]

52 A possible solution to this problem is to fill some volume of the state space with a library of local-control trajectories [20], and consider all of them when selecting the nearest linear gain model. [sent-114, score-0.185]

53 1 Experiments The swimmer dynamical system We describe a variation of the d-link swimmer dynamical system [21]. [sent-116, score-0.606]

54 The swimmer force −kn lˆ (x ˆ n t ˙ t is modeled as a chain of d such links of lengths li and masses mi , its configuration described by the generalized coordinates q = ( xcm ), of two center-of-mass coordinates and d angles. [sent-118, score-0.537]

55 Letting θ xi = xi − xcm be the positions of the link centers WRT the center of mass , the Lagrangian is ¯ n= ˆ 1 ˙ cm L = 2 x2 mi + 2 ˙ mi xi + ¯ 1 2 i i ˙2 Ii θi 1 2 i 1 2 with Ii = 12 mi li the moments-of-inertia. [sent-119, score-0.339]

56 (a) three snapshots of the receding horizon trajectory (dotted) with the current finite-horizon optimal trajectory (solid) appended, for two state dimensions. [sent-123, score-0.546]

57 (b) Projections of the same receding-horizon trajectories onto the largest three eigenvectors of the full state covariance matrix. [sent-124, score-0.141]

58 3, the linear regime of the reward, here applied to a 3-swimmer, compels the RH trajectories to a steady swimming gait – a limit cycle. [sent-126, score-0.327]

59 The function ¯ ˙ ˆ [li (xi · ni )2 + 1 F = − 2 kn 1 3 ˙2 12 li θi ] ˙ t li (xi · ˆi )2 1 − 2 kt i i known as the dissipation function, is that function whose derivatives WRT the qi ’s provide the postu˙ ¨ lated frictional forces. [sent-128, score-0.33]

60 With these in place, we can obtain q from the 2+d Euler-Lagrange equations: ∂ d dt ( ∂qi L) = ∂ ∂ qi F ˙ +u with u being the external forces and torques applied to the system. [sent-129, score-0.106]

61 By applying d − 1 torques τj in action-reaction pairs at the joints ui = τi − τi−1 , the isolated nature of the dynamical system q ¨ is preserved. [sent-130, score-0.144]

62 2 Internal coordinates The two coordinates specifying the position of the center-of-mass and the d angles are defined relative to an external coordinate system, which the controller should not have access to. [sent-133, score-0.264]

63 We make ˆ a coordinate transformation into internal coordinates, where only the d − 1 relative angles {θj = d−1 θj+1 − θj }j=1 are given, and the location of the target is given relative to coordinate system fixed on one of the links. [sent-134, score-0.19]

64 The collocation method allows us to perform this transformation directly on the vector cloud without having to explicitly differentiate it, as we would have had to using classical DDP. [sent-136, score-0.169]

65 Note also that this transformation reduces the dimension of the state (one angle less), suggesting the possibility of further dimensionality reduction. [sent-137, score-0.076]

66 3 The reward function The reward function we used was r(x, u) = −cx ||xnose ||2 ||xnose ||2 + 1 − cu ||u||2 (8) Where xnose = [x1 x2 ]T is the 2-vector from some designated point on the swimmer’s body to the target (the origin in internal space), and cx and cu are positive constants. [sent-139, score-0.526]

67 This reward is maximized when the nose is brought to rest on the target under a quadratic action-cost penalty. [sent-140, score-0.274]

68 It should not be confused with the desired state reward of classical optimal control since values are specified only for 2 out of the 2d + 4 coordinates. [sent-141, score-0.253]

69 The functional form of the target-reward term is designed to be linear in ||xnose || when far from the target and quadratic when close to it (Figure 2(b)). [sent-142, score-0.105]

70 (b) The functional form of the planar reward component r(xnose ) = −||xnose ||2 / ||xnose ||2 + 1. [sent-144, score-0.106]

71 This form translates into a steady swimming gait at large distances with a smooth braking and stopping at the goal. [sent-145, score-0.289]

72 Therefore, in the linear regime of the reward function, the solution is independent of the distance from the target, and all the trajectories are quickly compelled to converge to a one-dimensional manifold in state-space which describes steady-state swimming (Figure 1(b)). [sent-148, score-0.399]

73 Upon nearing the target, the swimmer must initiate a braking maneuver, and bring the nose to a standstill over the target. [sent-149, score-0.36]

74 For targets that are near the swimmer, the behaviour must also include various turns and jerks, quite different from steady-state swimming, which maneuver the nose into contact with the target. [sent-150, score-0.137]

75 Our experience during interaction with the controller, as detailed below, leads us to believe that the behavioral variety that would be exhibited by a hypothetical exact optimal controller for this system to be extremely large. [sent-151, score-0.085]

76 4 Results In order to asses the controllers we constructed a real-time interaction package3 . [sent-152, score-0.283]

77 By dragging the target with a cursor, a user can interact with controlled swimmers of 3 to 10 links with a state dimension varying from 10 to 24, respectively. [sent-153, score-0.222]

78 Even with controllers composed of a single trajectory, the swimmers perform quite well, turning, tracking and braking on approach to the target. [sent-154, score-0.461]

79 All of the controllers in the package control swimmers with unit link lengths and unit masses. [sent-155, score-0.476]

80 The function F computes a single 4tht+∆t order Runge-Kutta integration step of the continuous dynamics F (xk , uk ) = xk+ t f (xk , uk )dt with ∆t = 0. [sent-157, score-0.43]

81 The receding horizon window was of 40 time-steps, or 2 seconds. [sent-159, score-0.213]

82 Because nonlinear viscosity effects are not modeled and the local controllers are also linear, exponentially diverging torques and angular velocities can be produced. [sent-162, score-0.531]

83 Because DDP is a local optimization method, it is bound to stop in a local minimum. [sent-165, score-0.148]

84 An extension of this claim is that even if the solutions are optimal, this has to do with the swimmer domain itself, which might be inherently convex in some sense and therefore an “easy” problem. [sent-166, score-0.233]

85 Similarly, local minima exist even in the motor behaviour of the most complex organisms, famously evidenced by Fosbury’s reinvention of the high jump. [sent-170, score-0.195]

86 Regarding the easiness or difficulty of the swimmer problem – we made the documented code available and hope that it might serve as a useful benchmark for other algorithms. [sent-171, score-0.233]

87 5 Conclusions The significance of this work lies at its outlining of a new kind of tradeoff in nonlinear motor control design. [sent-172, score-0.197]

88 If biological realism is an accepted design goal, and physical and biological constraints taken into account, then the expectations we have from our controllers can be more relaxed than those of the control engineer. [sent-173, score-0.455]

89 The unavoidable eventual failure of any specific biological organism makes the design of truly robust controllers a futile endeavor, in effect putting more weight on the mode, rather than the tail of the behavioral distribution. [sent-174, score-0.315]

90 il/∼tassa/ We actually constrain angular velocities since limiting torque would require a stiffer integrator, but theoretical non-divergence is fully guaranteed by the viscous dissipation which enforces a Lyapunov function on the entire system, once torques are limited. [sent-180, score-0.33]

91 4 7 Since we make use of biologically grounded arguments, we briefly outline the possible implications of this work to biological nervous systems. [sent-181, score-0.087]

92 It is commonly acknowledged, due both to theoretical arguments and empirical findings, that some form of dimensionality reduction must be at work in neural control mechanisms. [sent-182, score-0.145]

93 A common object in models which attempt to describe this reduction is the motor primitive, a hypothesized atomic motor program which is combined with other such programs in a small “alphabet”, to produce complex behaviors in a given context. [sent-183, score-0.178]

94 Our controllers imply a different reduction: a set of complex prototypical motor programs, each of which is nearoptimal only in a small volume of the state-space, yet in that space describes the entire complexity of the solution. [sent-184, score-0.372]

95 Giving the simplest building blocks of the model such a high degree of task specificity or context, would imply a very large number of these motor prototypes in a real nervous system, an order of magnitude analogous, in our linguistic metaphor, to that of words and concepts. [sent-185, score-0.144]

96 Using local trajectory optimizers to speed up global optimization in dynamic programming. [sent-201, score-0.336]

97 Non-parametric representation of a policies and value functions: A trajectory based approach. [sent-207, score-0.147]

98 Minimax differential dynamic programming: An application to robust bipedwalking. [sent-221, score-0.166]

99 A second order gradient method for determining optimal trajectories for non-linear discretetime systems. [sent-249, score-0.102]

100 Advantages of differential dynamic programming over newton’s method for discrete-time optimal control problems. [sent-274, score-0.314]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ddp', 0.517), ('controllers', 0.283), ('swimmer', 0.233), ('xk', 0.207), ('uk', 0.196), ('swimming', 0.191), ('qux', 0.148), ('xnose', 0.148), ('trajectory', 0.147), ('receding', 0.129), ('qu', 0.126), ('quu', 0.111), ('control', 0.108), ('reward', 0.106), ('collocation', 0.106), ('trajectories', 0.102), ('vx', 0.101), ('uu', 0.089), ('motor', 0.089), ('differential', 0.088), ('swimmers', 0.085), ('controller', 0.085), ('horizon', 0.084), ('dynamic', 0.078), ('local', 0.074), ('torques', 0.074), ('vxx', 0.074), ('dynamical', 0.07), ('policy', 0.068), ('fx', 0.067), ('braking', 0.064), ('erez', 0.064), ('frictional', 0.064), ('qxu', 0.064), ('viscous', 0.064), ('cloud', 0.063), ('nose', 0.063), ('fu', 0.063), ('angular', 0.063), ('quadratic', 0.062), ('nervous', 0.055), ('torque', 0.055), ('links', 0.055), ('li', 0.054), ('coordinates', 0.053), ('kn', 0.048), ('kalman', 0.047), ('mi', 0.047), ('velocity', 0.045), ('library', 0.044), ('target', 0.043), ('coordinate', 0.042), ('bipedal', 0.042), ('fxx', 0.042), ('maneuver', 0.042), ('qxx', 0.042), ('tassa', 0.042), ('xcm', 0.042), ('reinforcement', 0.041), ('kt', 0.041), ('programming', 0.04), ('sweep', 0.039), ('state', 0.039), ('continuous', 0.038), ('walking', 0.038), ('dimensionality', 0.037), ('appended', 0.037), ('qx', 0.037), ('icra', 0.037), ('reminiscent', 0.037), ('atkeson', 0.037), ('dissipation', 0.037), ('velocities', 0.037), ('global', 0.037), ('filter', 0.034), ('xi', 0.034), ('gait', 0.034), ('robots', 0.034), ('scheme', 0.033), ('qi', 0.032), ('biological', 0.032), ('internal', 0.032), ('behaviour', 0.032), ('backward', 0.032), ('feedback', 0.032), ('synthesize', 0.031), ('cx', 0.031), ('liao', 0.031), ('articulated', 0.031), ('save', 0.031), ('smart', 0.031), ('angles', 0.031), ('guration', 0.03), ('starting', 0.03), ('organisms', 0.03), ('cu', 0.03), ('helicopter', 0.03), ('qk', 0.03), ('tracking', 0.029), ('wrt', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 163 nips-2007-Receding Horizon Differential Dynamic Programming

Author: Yuval Tassa, Tom Erez, William D. Smart

Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1

2 0.23800966 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

3 0.12929671 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

4 0.12547769 100 nips-2007-Hippocampal Contributions to Control: The Third Way

Author: Máté Lengyel, Peter Dayan

Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1

5 0.088333048 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

6 0.086652867 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

7 0.083120376 213 nips-2007-Variational Inference for Diffusion Processes

8 0.080510981 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

9 0.079799861 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

10 0.075783573 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

11 0.069269039 102 nips-2007-Incremental Natural Actor-Critic Algorithms

12 0.068961419 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

13 0.063713185 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

14 0.058737852 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

15 0.056691498 185 nips-2007-Stable Dual Dynamic Programming

16 0.055081684 86 nips-2007-Exponential Family Predictive Representations of State

17 0.052109174 30 nips-2007-Bayes-Adaptive POMDPs

18 0.049317561 214 nips-2007-Variational inference for Markov jump processes

19 0.04891599 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

20 0.04688992 127 nips-2007-Measuring Neural Synchrony by Message Passing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.182), (1, -0.144), (2, 0.049), (3, -0.035), (4, -0.137), (5, 0.077), (6, 0.021), (7, 0.005), (8, -0.037), (9, 0.026), (10, -0.063), (11, -0.01), (12, 0.002), (13, 0.011), (14, 0.072), (15, 0.063), (16, -0.102), (17, 0.032), (18, 0.008), (19, -0.099), (20, 0.077), (21, 0.014), (22, -0.091), (23, 0.033), (24, -0.037), (25, -0.137), (26, 0.068), (27, 0.053), (28, -0.054), (29, -0.049), (30, 0.074), (31, -0.147), (32, 0.091), (33, 0.036), (34, -0.107), (35, 0.005), (36, 0.007), (37, 0.018), (38, 0.047), (39, 0.15), (40, -0.088), (41, -0.008), (42, -0.074), (43, 0.028), (44, -0.018), (45, 0.02), (46, 0.23), (47, -0.028), (48, -0.067), (49, -0.128)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95477927 163 nips-2007-Receding Horizon Differential Dynamic Programming

Author: Yuval Tassa, Tom Erez, William D. Smart

Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1

2 0.80010676 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

3 0.58859164 100 nips-2007-Hippocampal Contributions to Control: The Third Way

Author: Máté Lengyel, Peter Dayan

Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1

4 0.46503195 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

5 0.46304923 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

6 0.45759326 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

7 0.45679796 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

8 0.43330988 52 nips-2007-Competition Adds Complexity

9 0.41956854 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

10 0.41294926 214 nips-2007-Variational inference for Markov jump processes

11 0.4099007 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

12 0.40608394 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

13 0.40407142 174 nips-2007-Selecting Observations against Adversarial Objectives

14 0.39020902 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

15 0.37010786 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

16 0.3569904 169 nips-2007-Retrieved context and the discovery of semantic structure

17 0.35377929 185 nips-2007-Stable Dual Dynamic Programming

18 0.35077572 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

19 0.3335467 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

20 0.33245793 213 nips-2007-Variational Inference for Diffusion Processes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.018), (13, 0.033), (16, 0.025), (18, 0.022), (19, 0.012), (21, 0.047), (31, 0.077), (34, 0.034), (35, 0.023), (47, 0.096), (49, 0.016), (53, 0.336), (83, 0.084), (85, 0.019), (87, 0.016), (90, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76085269 163 nips-2007-Receding Horizon Differential Dynamic Programming

Author: Yuval Tassa, Tom Erez, William D. Smart

Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1

2 0.63884443 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efficiency. 1

3 0.44453678 214 nips-2007-Variational inference for Markov jump processes

Author: Manfred Opper, Guido Sanguinetti

Abstract: Markov jump processes play an important role in a large number of application domains. However, realistic systems are analytically intractable and they have traditionally been analysed using simulation based techniques, which do not provide a framework for statistical inference. We propose a mean field approximation to perform posterior inference and parameter estimation. The approximation allows a practical solution to the inference problem, while still retaining a good degree of accuracy. We illustrate our approach on two biologically motivated systems.

4 0.43702739 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

5 0.43004662 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index

Author: Harald Steck, Balaji Krishnapuram, Cary Dehing-oberije, Philippe Lambin, Vikas C. Raykar

Abstract: In this paper, we show that classical survival analysis involving censored data can naturally be cast as a ranking problem. The concordance index (CI), which quantifies the quality of rankings, is the standard performance measure for model assessment in survival analysis. In contrast, the standard approach to learning the popular proportional hazard (PH) model is based on Cox’s partial likelihood. We devise two bounds on CI–one of which emerges directly from the properties of PH models–and optimize them directly. Our experimental results suggest that all three methods perform about equally well, with our new approach giving slightly better results. We also explain why a method designed to maximize the Cox’s partial likelihood also ends up (approximately) maximizing the CI. 1

6 0.42853814 80 nips-2007-Ensemble Clustering using Semidefinite Programming

7 0.42761281 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.42007065 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.41597259 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

10 0.41545379 174 nips-2007-Selecting Observations against Adversarial Objectives

11 0.41467059 63 nips-2007-Convex Relaxations of Latent Variable Training

12 0.41139796 100 nips-2007-Hippocampal Contributions to Control: The Third Way

13 0.41124645 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

14 0.41058803 86 nips-2007-Exponential Family Predictive Representations of State

15 0.40948686 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

16 0.4090789 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

17 0.40886813 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

18 0.40715864 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

19 0.40595824 43 nips-2007-Catching Change-points with Lasso

20 0.40583137 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning