nips nips2002 nips2002-155 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. [sent-6, score-0.411]
2 We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. [sent-7, score-0.275]
3 We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. [sent-9, score-0.833]
4 1 Introduction The widespread application of reinforcement learning is hindered by excessive cost in terms of one or more of representational resources, computation time, or amount of training data. [sent-11, score-0.243]
5 We reduce the computation time required by using more powerful updates that update first and second derivatives of value functions and first derivatives of policies, in addition to updating value function and policy values at particular points [3, 4, 5]. [sent-15, score-0.793]
6 We reduce the representational resources needed by representing value functions and policies along carefully chosen trajectories. [sent-16, score-0.413]
7 This paper explores how the approach can be extended to periodic tasks such as hopping and walking. [sent-18, score-0.487]
8 Previous work has explored how to apply an early version of this approach to tasks with an explicit goal state [3, 6] and how to simultaneously learn a model and ¡ also affiliated with the ATR Human Information Science Laboratories, Dept. [sent-19, score-0.262]
9 3 use this approach to compute a policy and value function [6]. [sent-20, score-0.452]
10 Handling periodic tasks required accommodating discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. [sent-21, score-0.647]
11 Our first key idea for creating a more global policy is to coordinate many trajectories, similar to using the method of characteristics to solve a partial differential equation. [sent-24, score-0.379]
12 As long as the value functions are consistent between trajectories, and cover the appropriate space, the global value function created will be correct. [sent-26, score-0.27]
13 This representation supports accurate updating since any updates must occur along densely represented optimized trajectories, and an adaptive resolution representation that allocates resources to where optimal trajectories tend to go. [sent-27, score-0.682]
14 A second key idea is to segment the trajectories at discontinuities of the system dynamics, to reduce the amount of discontinuity in the value function within each segment, so our extrapolation operations are correct more often. [sent-29, score-0.743]
15 Unfortunately, in periodic tasks such as hopping or walking the dynamics changes discontinuously as feet touch and leave the ground. [sent-31, score-0.828]
16 For periodic tasks we apply our approach along trajectory segments which end whenever a dynamics (or criterion) discontinuity is reached. [sent-33, score-0.831]
17 We also search for value function discontinuities not collocated with dynamics or criterion discontinuities. [sent-34, score-0.349]
18 We can use all the trajectory segments that start at the discontinuity and continue through the next region to provide estimates of the value function at the other side of the discontinuity. [sent-35, score-0.463]
19 Update first and second derivatives of the value function as well as first derivatives of the policy (control gains for a linear controller) along the trajectory. [sent-38, score-0.641]
20 We can think of this as updating the first few terms of local Taylor series models of the global value and policy functions. [sent-39, score-0.552]
21 Because we are interested in periodic tasks, we must introduce a discount factor into Bellman’s equation, so value functions remain finite. [sent-42, score-0.384]
22 Consider a system with dynamics and a one step cost function , where is the state of the system and is a vector of actions or controls. [sent-43, score-0.382]
23 ¨¦¤¢ ¡ ¡ © § ¥ £ ¡ & A goal of reinforcement learning and optimal control is to find a policy that minimizes the total cost, which is the sum of the costs for each time step. [sent-47, score-0.663]
24 The value of this value function at a state is construct an optimal value function, the sum of all future costs, given that the system started in state and followed the optimal policy (chose optimal actions at each time step as a function of the state). [sent-49, score-0.966]
25 A local planner or controller can choose globally optimal actions if it knew the future cost of each action. [sent-50, score-0.458]
26 This cost is simply the sum of the cost of taking the action right now and the discounted future cost of the state that the action leads to, which is given by the value function. [sent-51, score-0.5]
27 2 −4 1 −3 −2 −1 0 Velocity 1 2 3 4 Figure 1: Example trajectories where the value function and policy are explicitly represented for a regulator task at goal state G (left), a task with a point goal state G (middle), and a periodic task (right). [sent-62, score-1.62]
28 Unfortunately, in periodic tasks such as hopping or walking the dynamics changes discontinuously as feet touch and leave the ground. [sent-67, score-0.828]
29 For periodic tasks we apply our approach along trajectory segments which end whenever a dynamics (or criterion) discontinuity is reached. [sent-69, score-0.831]
30 We can use all the trajectory segments that start at the discontinuity and continue through the next region to provide estimates of the value function at the other side of the discontinuity. [sent-70, score-0.463]
31 On the left we see that a task that requires steady state control about a goal point (a regulator task) can be solved with a single trivial trajectory that starts and ends at the goal and provides a value function and constant linear policy in the vicinity of the goal. [sent-72, score-1.079]
32 2 −4 −3 −2 −1 Figure 2: The optimal hopper controller with a range of penalties on . [sent-92, score-0.254]
33 ¦¦¡ § 5# © ¨ £ £ £ £ §¥§¦¢ £ £ ¥¤¢ ¢ The middle figure of Figure 1 shows the trajectories used to compute the value function for a swing up problem [3]. [sent-96, score-0.522]
34 However, the nonlinearities of the problem limit the region of applicability of a linear policy, and non-trivial trajectories have to be created to cover a larger region. [sent-98, score-0.533]
35 The neighboring trajectories have consistent value functions and thus the globally optimal value function and policy is found in the explored region [3]. [sent-100, score-1.094]
36 The right figure of Figure 1 shows the trajectories used to compute the value function for a periodic problem, control of vertical hopping in a hopping robot. [sent-101, score-1.155]
37 In this problem, there is no goal state, but a desired hopping height is specified. [sent-102, score-0.336]
38 This problem has been extensively studied in the robotics literature [8] from the point of view of how to manually design a nonlinear controller with a large stability region. [sent-103, score-0.284]
39 We note that optimal control provides a methodology to design nonlinear controllers with large stability regions and also good performance in terms of explicitly specified criteria. [sent-104, score-0.274]
40 In the top two quadrants the robot is in the air, and in the bottom two quadrants the robot is on the ground. [sent-108, score-0.406]
41 Thus, the horizontal axis is a discontinuity of the robot dynamics, and trajectory segments end and often begin at the discontinuity. [sent-109, score-0.517]
42 We see that while the robot is in the air it cannot change how much energy it has (how high it goes or how fast it is going when it hits the ground), as the trajectories end with the same pattern they began with. [sent-110, score-0.615]
43 When the robot is on the ground it thrusts with its leg to “focus” the trajectories so the set of touchdown positions is mapped to a smaller set of takeoff positions. [sent-111, score-0.818]
44 This funneling effect is characteristic of controllers for periodic tasks, and how fast the funnel becomes narrow is controlled by the size of the penalty on usage (Figure 2). [sent-112, score-0.299]
45 In our approach trajectories are refined towards optimality given their fixed starting points. [sent-115, score-0.454]
46 For regulator tasks, the trajectory is trivial and simply starts and ends at the known goal point. [sent-117, score-0.353]
47 For tasks with a point goal, trajectories can be extended backwards away from the goal [3]. [sent-118, score-0.651]
48 For periodic tasks, crude trajectories must be created using some other approach before this approach can refine them. [sent-119, score-0.645]
49 In learning from demonstration a teacher provides initial trajectories [6]. [sent-122, score-0.412]
50 In policy optimization (aka “policy search”) a parameterized policy is optimized [9]. [sent-123, score-0.806]
51 Once a set of initial task trajectories are available, the following four methods are used to generate trajectories in new parts of state space. [sent-124, score-0.961]
52 We use all of these methods simultaneously, and locally optimize each of the trajectories produced. [sent-125, score-0.412]
53 The best trajectory of the set is then stored and the other trajectories are discarded. [sent-126, score-0.601]
54 1) Use the global policy generated by policy optimization, if available. [sent-127, score-0.758]
55 2) Use the local policy from the nearest point with the same type of dynamics. [sent-128, score-0.466]
56 and 4) Use the policy from the nearest trajectory, where the nearest trajectory is selected at the beginning of the forward sweep and kept the same throughout the sweep. [sent-130, score-0.75]
57 Note that methods 2 and 3 can change which stored trajectories they take points from on each time step, while method 4 uses a policy from a single neighboring trajectory. [sent-131, score-0.791]
58 3 Control of a walking robot As another example we will describe the search for a policy for walking of a simple planar biped robot that walks along a bar. [sent-132, score-1.059]
59 The simulated robot has two legs and a torque motor between the legs. [sent-133, score-0.249]
60 Instead of revolute or telescoping knees, the robot can grab the bar with its foot as its leg swings past it. [sent-134, score-0.507]
61 This is a model of a robot that walks along the trusses of a large structure such as a bridge, much as a monkey brachiates with its arms. [sent-135, score-0.249]
62 This simple model has also been used in studies of robot passive dynamic walking [10]. [sent-136, score-0.251]
63 F£¡ ¢ This arrangement means the robot has a five dimensional state space: left leg angle , right leg angle , left leg angular velocity , right leg angular velocity , and stance foot location. [sent-137, score-1.825]
64 A simple policy is used to determine when to grab the bar (at the end of a step when the swing foot passes the bar going downwards). [sent-138, score-0.517]
65 provides a measure of how far the left or right leg has gone in the forward or backward direction. [sent-147, score-0.29]
66 is the product of past its limits the leg angles if the legs are both forward or both rearward, and zero otherwise. [sent-148, score-0.342]
67 H£F¥ ¢ C£ D G E D 6 @ 8 6 £A9© 7£ ¡ 3 £ £ £ £ £ §¥§¥¤¢ # ¦3 B Initial trajectories were generated by optimizing the coefficients of a linear policy. [sent-154, score-0.412]
68 When the left leg was in stance: (7) ¦ C C C a P VT ¦ 3 VT ¨ XT ` P ¢¦ Y P C B ¤ ¢¦ W P 1¨ XT C U P VT ¨ ¢ © C P T ¥ P 5RC ¥ ¤ ¢ 1¨S § P Q where is the angle between the legs. [sent-155, score-0.293]
69 When the right leg was in stance the same policy was used with the appropriate signs negated. [sent-156, score-0.715]
70 1 Results The trajectory-based approach was able to find a cheaper and more robust policy than the parametric policy-optimization approach. [sent-158, score-0.5]
71 Cost: For example, after training the parametric policy, we measured the undiscounted cost over 1 second (roughly one step of each leg) starting in a state along the lowest cost cyclic trajectory. [sent-160, score-0.473]
72 The cost for the optimized parametric policy was 4316. [sent-161, score-0.608]
73 Robustness: We did a simple assessment of robustness by adding offsets to the same starting state until the optimized linear policy failed. [sent-163, score-0.677]
74 The offsets were in terms of the stance leg and the angle between the legs, and the corresponding angular velocities. [sent-164, score-0.487]
75 The maximum offsets for the linearized optimized parametric policy are , , , and . [sent-165, score-0.547]
76 This approach, extending the range most in the cases of is not surprising, since the trajectory-based controller uses the parametric policy as one of the ways to initially generate candidate trajectories for optimization. [sent-168, score-1.021]
77 In cases where the trajectory-based approach is not able to generate an appropriate trajectory, the system will generate a series of trajectories with start points moving from regions it knows how to handle towards the desired start point. [sent-169, score-0.489]
78 The trajectory approach has the same cost as before, 3502. [sent-174, score-0.302]
79 Another form of robustness is robustness to modeling error (changes in masses, friction, and other model parameters) and imperfect sensing, so that the controller does not know exactly what state the robot is in. [sent-176, score-0.586]
80 Since simulations are used to optimize policies, it is relatively easy to include simulations with different model parameters and sensor noise in the training and optimize for a robust parametric controller in policy shaping. [sent-177, score-0.662]
81 The probabilistic approach supports actions by the controller to actively minimize uncertainty as well as achieve goals, which is known as dual control. [sent-180, score-0.251]
82 In the probabilistic case, the state is augmented with any unknown parameters such as masses of parts or friction coefficients, and the covariance of all the original elements of the state as well as the added parameters. [sent-183, score-0.256]
83 These covariances interact with the curvature of the value function, causing additional cost in areas of the value function that have high curvature or second derivatives. [sent-187, score-0.303]
84 The system is also rewarded when it learns, which reduces the covariances of the estimates, so the system may choose actions that move away from a goal but reduce uncertainty. [sent-189, score-0.245]
85 This probabilistic approach does dramatically increase the dimensionality of the state vector and thus the value function, but in the context of only a quadratic cost on dimensionality this is not as fatal is it would seem. [sent-190, score-0.274]
86 This is closely related to robust nonlinear controller design techniques based on the idea of control [11, 12] and risk sensitive control [13, 14]. [sent-193, score-0.386]
87 We augment the dynamics equation with a disturbance term: where is a vector of disturbance inputs. [sent-194, score-0.459]
88 To limit the size of the disturbances, we include the disturbance magnitude in a modified one step cost function with a negative sign. [sent-195, score-0.279]
89 The opponent who controls the disturbance wants to increase our cost, so this new term gives an incentive to the opponent to choose the worse direction for the disturbance, and a disturbance magnitude that gives the highest ratio of increased cost to disturbance size: . [sent-196, score-0.733]
90 ) § " 31 £ 5 How to cover a volume of state space In tasks with a goal or point attractor, [3] showed that certain key trajectories can be grown backwards from the goal in order to approximate the value function. [sent-205, score-0.933]
91 In the case of a sparse use of trajectories to cover a space, the cost of the approach is dominated by the costs of updating second derivative matrices, and thus the cost of the trajectory-based approach increases quadratically as the dimensionality increases. [sent-206, score-0.774]
92 © E However, for periodic tasks the approach of growing trajectories backwards from the goal cannot be used, as there is no goal point or set. [sent-207, score-0.92]
93 In this case the trajectories that form the optimal cycle can be used as key trajectories, with each point along them supplying a local linear policy and local quadratic value function. [sent-208, score-1.053]
94 These key trajectories can be computed using any optimization method, and then the corresponding policy and value function estimates along the trajectory computed using the update rules given here. [sent-209, score-1.102]
95 It is important to point out that optimal trajectories need only be placed densely enough to separate regions which have different local optima. [sent-210, score-0.574]
96 The trajectories used in the representation usually follow local valleys of the value function. [sent-211, score-0.532]
97 Using these trajectories and creating new trajectories as task demands require it, we expect to be able to handle a range of natural tasks. [sent-213, score-0.915]
98 The trajectory-based approach requires less design skill from humans since it doesn’t need a “good” policy parameterization, produces cheaper and more robust policies, which do not suffer from interference. [sent-215, score-0.471]
99 Control of forward velocity for a simplified planar hopping robot. [sent-262, score-0.413]
100 Autonomous helicopter control using reinforcement learning policy search methods. [sent-266, score-0.513]
wordName wordTfidf (topN-words)
[('trajectories', 0.412), ('policy', 0.379), ('leg', 0.243), ('periodic', 0.198), ('trajectory', 0.189), ('hopping', 0.186), ('disturbance', 0.166), ('robot', 0.163), ('controller', 0.162), ('velocity', 0.128), ('dynamics', 0.127), ('policies', 0.114), ('cost', 0.113), ('discontinuity', 0.111), ('discontinuities', 0.111), ('tasks', 0.103), ('regulator', 0.093), ('stance', 0.093), ('walking', 0.088), ('state', 0.088), ('height', 0.079), ('discount', 0.074), ('value', 0.073), ('goal', 0.071), ('derivatives', 0.07), ('parametric', 0.068), ('reinforcement', 0.068), ('robustness', 0.068), ('control', 0.066), ('atkeson', 0.065), ('backwards', 0.065), ('representational', 0.062), ('opponent', 0.061), ('morimoto', 0.061), ('controllers', 0.055), ('foot', 0.055), ('sweep', 0.055), ('actions', 0.054), ('segments', 0.054), ('robust', 0.053), ('updating', 0.053), ('offsets', 0.052), ('legs', 0.052), ('planar', 0.052), ('cover', 0.05), ('robotics', 0.05), ('angle', 0.05), ('taylor', 0.05), ('task', 0.049), ('along', 0.049), ('angular', 0.049), ('optimized', 0.048), ('local', 0.047), ('forward', 0.047), ('discontinuously', 0.046), ('feet', 0.046), ('grab', 0.046), ('hopper', 0.046), ('lqr', 0.046), ('interference', 0.046), ('usage', 0.046), ('optimal', 0.046), ('covariances', 0.044), ('starting', 0.042), ('handle', 0.042), ('masses', 0.04), ('rewarded', 0.04), ('air', 0.04), ('biped', 0.04), ('friction', 0.04), ('hip', 0.04), ('quadrants', 0.04), ('nearest', 0.04), ('movements', 0.04), ('resources', 0.04), ('functions', 0.039), ('design', 0.039), ('christopher', 0.038), ('criteria', 0.038), ('criterion', 0.038), ('sensing', 0.037), ('walks', 0.037), ('imperfect', 0.037), ('atr', 0.037), ('swing', 0.037), ('isaacs', 0.037), ('minimax', 0.036), ('globally', 0.036), ('region', 0.036), ('reduce', 0.036), ('created', 0.035), ('uncertainty', 0.035), ('regions', 0.035), ('densely', 0.034), ('plant', 0.034), ('touch', 0.034), ('torque', 0.034), ('vertical', 0.034), ('costs', 0.033), ('stability', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
2 0.3081634 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
3 0.26188472 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
4 0.23599078 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
5 0.20830357 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
6 0.20285092 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
7 0.19137166 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
8 0.17727794 20 nips-2002-Adaptive Caching by Refetching
9 0.16420564 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
10 0.15843599 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
11 0.14684393 134 nips-2002-Learning to Take Concurrent Actions
12 0.14161661 128 nips-2002-Learning a Forward Model of a Reflex
13 0.13819054 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
14 0.1372003 137 nips-2002-Location Estimation with a Differential Update Network
15 0.12590322 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
16 0.11485138 169 nips-2002-Real-Time Particle Filters
17 0.11092608 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
18 0.10176206 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
19 0.095901355 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation
20 0.091843784 205 nips-2002-Value-Directed Compression of POMDPs
topicId topicWeight
[(0, -0.219), (1, 0.026), (2, -0.45), (3, -0.096), (4, 0.034), (5, -0.058), (6, 0.109), (7, -0.011), (8, 0.225), (9, 0.252), (10, -0.171), (11, 0.075), (12, 0.004), (13, -0.057), (14, -0.112), (15, 0.009), (16, 0.014), (17, -0.131), (18, 0.102), (19, 0.061), (20, -0.066), (21, 0.016), (22, -0.067), (23, -0.07), (24, 0.065), (25, 0.003), (26, 0.045), (27, 0.085), (28, 0.003), (29, -0.01), (30, -0.055), (31, -0.04), (32, -0.009), (33, -0.034), (34, 0.006), (35, 0.043), (36, 0.013), (37, 0.05), (38, 0.074), (39, -0.021), (40, -0.007), (41, -0.028), (42, -0.076), (43, 0.044), (44, 0.081), (45, 0.04), (46, -0.006), (47, 0.014), (48, -0.031), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.97250515 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
2 0.80708444 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
3 0.76091629 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
Author: Benjamin V. Roy, Daniela D. Farias
Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.
4 0.71808553 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
5 0.685265 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
6 0.67279279 20 nips-2002-Adaptive Caching by Refetching
7 0.57970482 134 nips-2002-Learning to Take Concurrent Actions
8 0.53540868 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
9 0.51482373 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics
10 0.50484806 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
11 0.49460441 185 nips-2002-Speeding up the Parti-Game Algorithm
12 0.49013194 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
13 0.4877266 128 nips-2002-Learning a Forward Model of a Reflex
14 0.44163471 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
15 0.40322918 137 nips-2002-Location Estimation with a Differential Update Network
16 0.34737533 160 nips-2002-Optoelectronic Implementation of a FitzHugh-Nagumo Neural Model
17 0.34271619 205 nips-2002-Value-Directed Compression of POMDPs
18 0.33737102 153 nips-2002-Neural Decoding of Cursor Motion Using a Kalman Filter
19 0.32137671 169 nips-2002-Real-Time Particle Filters
20 0.31009284 47 nips-2002-Branching Law for Axons
topicId topicWeight
[(14, 0.012), (23, 0.07), (42, 0.076), (54, 0.137), (55, 0.056), (67, 0.015), (68, 0.034), (71, 0.248), (73, 0.016), (74, 0.092), (83, 0.011), (92, 0.03), (98, 0.094)]
simIndex simValue paperId paperTitle
1 0.93540645 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
same-paper 2 0.82968688 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
3 0.74374199 106 nips-2002-Hyperkernels
Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola
Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.
4 0.67294872 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
5 0.6495291 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
6 0.64558858 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
7 0.64546996 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
8 0.64350772 10 nips-2002-A Model for Learning Variance Components of Natural Images
9 0.64112711 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
10 0.64087772 169 nips-2002-Real-Time Particle Filters
11 0.64007509 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
12 0.63954234 2 nips-2002-A Bilinear Model for Sparse Coding
13 0.63895261 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
14 0.63756126 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
15 0.63525927 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
16 0.63524926 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
17 0.6332823 74 nips-2002-Dynamic Structure Super-Resolution
18 0.6316517 137 nips-2002-Location Estimation with a Differential Update Network
19 0.63106579 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
20 0.63090259 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers