nips nips2002 nips2002-144 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. [sent-7, score-0.635]
2 As an example, we applied our method to a simulated five link biped robot. [sent-8, score-0.301]
3 The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. [sent-9, score-0.344]
4 Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. [sent-10, score-0.943]
5 Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated. [sent-11, score-0.315]
6 1 Introduction Reinforcement learning[8] is widely studied because of its promise to automatically generate controllers for difficult tasks from attempts to do the task. [sent-12, score-0.043]
7 However, reinforcement learning requires a great deal of training data and computational resources, and sometimes fails to learn high dimensional tasks. [sent-13, score-0.034]
8 To improve reinforcement learning, we propose using differential dynamic programming (DDP) which is a second order local trajectory optimization method to generate locally optimal plans and local models of the value function[2, 4]. [sent-14, score-0.278]
9 However, when we apply dynamic programming to a real environment, handling inevitable modeling errors is crucial. [sent-16, score-0.11]
10 In this study, we develop minimax differential dynamic programming which provides robust nonlinear controller designs based on the idea of H∞ control[9, 5] or risk sensitive control[6, 1]. [sent-17, score-0.65]
11 We apply the proposed method to a simulated five link biped robot (Fig. [sent-18, score-0.434]
12 Our strategy is to use minimax DDP to find both a low torque biped walk and a policy or control law to handle deviations from the optimized trajectory. [sent-20, score-0.799]
13 We show that both standard DDP and minimax DDP can find a local policy for lower torque biped walk than a handtuned PD servo controller. [sent-21, score-0.87]
14 We show that minimax DDP can cope with larger modeling error than standard DDP or the hand-tuned PD controller. [sent-22, score-0.298]
15 Thus, the robust controller allows us to collect useful training data. [sent-23, score-0.281]
16 In addition, we can use learning to correct modeling ∗ also affiliated with Human Information Science Laboratories, Department 3, ATR International errors and model previously unknown disturbances, and design a new more optimal robust controller using additional iterations of minimax DDP. [sent-24, score-0.579]
17 Differential dynamic programming maintains a second order local model of a Q function (Q(i), Qx (i), Qu (i), Qxx (i), Qxu (i), Quu (i)), where Q(i) = r(xi , ui , i) + V (xi+1 , i + 1), and the subscripts indicate partial derivatives. [sent-27, score-0.354]
18 Then, we can derive the new control output unew = ui + δui from arg maxδui Q(xi + δxi , ui + δui , i). [sent-28, score-0.608]
19 Finally, i by using the new control output unew , a second order local model of the value function i (V (i), Vx (i), Vxx (i)) can be derived [2, 4]. [sent-29, score-0.138]
20 2 Finding a local policy DDP finds a locally optimal trajectory xopt and the corresponding control trajectory uopt . [sent-31, score-0.347]
21 i i When we apply our control algorithm to a real environment, we usually need a feedback controller to cope with unknown disturbances or modeling errors. [sent-32, score-0.495]
22 Fortunately, DDP provides us a local policy along the optimized trajectory: uopt (xi , i) = uopt + Ki (xi − xopt ), i i (2) where Ki is a time dependent gain matrix given by taking the derivative of the optimal policy with respect to the state [2, 4]. [sent-33, score-0.28]
23 The difference is that the proposed method has an additional disturbance variable w to explicitly represent the existence of disturbances. [sent-36, score-0.097]
24 This representation of the disturbance provides the robustness for optimized trajectories and policies [5]. [sent-37, score-0.204]
25 Here, δui and δwi must be chosen to minimize and maximize the second order expansion of the Q function Q(xi + δxi , ui + δui , wi + δwi , i) in (3) respectively, i. [sent-39, score-0.357]
26 , δui = −Q−1 (i)[Qux (i)δxi + Quw (i)δwi + Qu (i)] uu δwi = −Q−1 (i)[Qwx (i)δxi + Qwu (i)δui + Qw (i)]. [sent-41, score-0.038]
27 ww (13) By solving (13), we can derive both δui and δwi . [sent-42, score-0.04]
28 1 Biped robot model In this paper, we use a simulated five link biped robot (Fig. [sent-45, score-0.567]
29 Kinematic and dynamic parameters of the simulated robot are chosen to match those of a biped robot we are currently developing (Fig. [sent-47, score-0.583]
30 Height and total weight of the robot are about 0. [sent-49, score-0.133]
31 3 link3 joint2,3 4 link4 link2 joint4 2 link1 joint1 5 1 link5 ankle Figure 1: Left: Five link robot model, Right: Real robot Table 1: Physical parameters of the robot model link1 link2 link3 link4 link5 mass [kg] 0. [sent-53, score-0.528]
32 75 We can represent the forward dynamics of the biped robot as xi+1 = f (xi ) + b(xi )ui , (15) ˙ ˙ where x = {θ1 , . [sent-68, score-0.437]
33 , θ5 } denotes the input state vector, u = {τ1 , . [sent-74, score-0.079]
34 , τ4 } denotes the control command (each torque τj is applied to joint j (Fig. [sent-77, score-0.249]
35 In the minimax optimization case, we explicitly represent the existence of the disturbance as xi+1 = f (xi ) + b(xi )ui + bw (xi )wi , (16) where w = {w0 , w1 , w2 , w3 , w4 } denotes the disturbance (w0 is applied to ankle, and wj (j = 1 . [sent-79, score-0.499]
36 The term (xi − xd )T Q(xi − xd ) encourages the robot to i i follow the nominal trajectory, the term ui T Rui discourages using large control outputs, and the term (v(xi ) − v d )T S(v(xi ) − v d ) encourages the robot to achieve the desired velocity. [sent-87, score-0.735]
37 (19) The term F (x0 ) penalizes an initial state where the foot is not on the ground: F (x0 ) = Fh T (x0 )P0 Fh (x0 ), (20) where Fh (x0 ) denotes height of the swing foot at the initial state x0 . [sent-89, score-0.201]
38 The term ΦN (x0 , xN ) is used to generate periodic trajectories: ΦN (x0 , xN ) = (xN − H(x0 ))T PN (xN − H(x0 )), (21) where xN denotes the terminal state, x0 denotes the initial state, and the term (xN − H(x0 ))T PN (xN − H(x0 )) is a measure of terminal control accuracy. [sent-90, score-0.249]
39 A function H() represents the coordinate change caused by the exchange of a support leg and a swing leg, and the velocity change caused by a swing foot touching the ground (Appendix A). [sent-91, score-0.211]
40 We implement the minimax DDP by adding a minimax term to the criterion. [sent-92, score-0.504]
41 We use a modified objective function: N −1 wi T Gwi , Jminimax = J − (22) i=0 where wi denotes a disturbance vector at the i-th time step, and the term wi T Gwi rewards coping with large disturbances. [sent-93, score-0.504]
42 This explicit representation of the disturbance w provides the robustness for the controller [5]. [sent-94, score-0.328]
43 4 Results We compare the optimized controller with a hand-tuned PD servo controller, which also is the source of the initial and nominal trajectories in the optimization process. [sent-95, score-0.528]
44 0} in equation (21), where IN denotes N dimensional identity matrix. [sent-111, score-0.053]
45 For minimax DDP, we set the parameter for the disturbance reward in equation (22) as G = diag{5. [sent-112, score-0.349]
46 0} (G with smaller elements generates more conservative but robust trajectories). [sent-117, score-0.048]
47 Each parameter is set to acquire the best results in terms of both the robustness and the energy efficiency. [sent-118, score-0.023]
48 When we apply the controllers acquired by standard DDP and minimax DDP to the biped walk, we adopt a local policy which we introduced in section 2. [sent-119, score-0.616]
49 Results in table 2 show that the controller generated by standard DDP and minimax DDP did almost halve the cost of the trajectory, as compared to that of the original hand-tuned PD servo controller. [sent-121, score-0.697]
50 However, because the minimax DDP is more conservative in taking advantage of the plant dynamics, it has a slightly higher control cost than the standard DDP. [sent-122, score-0.37]
51 N −1 1 Note that we defined the control cost as N i=0 ||ui ||2 , where ui is the control output (torque) vector at i-th time step, and N denotes total time step for one step trajectories. [sent-123, score-0.544]
52 Table 2: One step control cost (average over 100 steps) PD servo standard DDP minimax DDP control cost [(N · m)2 × 10−2 ] 7. [sent-124, score-0.683]
53 86 To test robustness, we assume that there is unknown viscous friction at each joint: dist ˙ τj = −µj θj (j = 1, . [sent-127, score-0.069]
54 , 4), (23) where µj denotes the viscous friction coefficient at joint j. [sent-130, score-0.128]
55 We used two levels of disturbances in the simulation, with the higher level being 3 times larger than the base level (Table 3). [sent-131, score-0.128]
56 Table 3: Parameters of the disturbance µ2 ,µ3 (hip joints) µ1 ,µ4 (knee joints) base 0. [sent-132, score-0.118]
57 15 All methods could handle the base level disturbances. [sent-136, score-0.021]
58 Both the standard and the minimax DDP generated much less control cost than the hand-tuned PD servo controller (Table 4). [sent-137, score-0.754]
59 However, only the minimax DDP control design could cope with the higher level of disturbances. [sent-138, score-0.389]
60 Figure 2 shows trajectories for the three different methods. [sent-139, score-0.065]
61 Both the simulated robot with the standard DDP and the hand-tuned PD servo controller fell down before achieving 100 steps. [sent-140, score-0.55]
62 The bottom of figure 2 shows part of a successful biped walking trajectory of the robot with the minimax DDP. [sent-141, score-0.742]
63 Figure 3 shows ankle joint trajectories for the three different methods. [sent-142, score-0.2]
64 Only the minimax DDP successfully kept ankle joint θ 1 around 90 degrees more than 20 seconds. [sent-143, score-0.387]
65 Table 5 shows the number of steps before the robot fell down. [sent-144, score-0.152]
66 We terminated a trial when the robot achieved 1000 steps. [sent-145, score-0.133]
67 Table 4: One step control cost with the base setting (averaged over 100 steps) PD servo standard DDP minimax DDP control cost [(N · m)2 × 10−2 ] 8. [sent-146, score-0.704]
68 87 Hand-tuned PD servo Standard DDP Minimax DDP Figure 2: Biped walk trajectories with the three different methods 5 Learning the unmodeled dynamics In section 4, we verified that minimax DDP could generate robust biped trajectories and policies. [sent-149, score-0.951]
69 The minimax DDP coped with larger disturbances than the standard DDP and the hand-tuned PD servo controller. [sent-150, score-0.535]
70 However, if there are modeling errors, using a robust controller which does not learn is not particularly energy efficient. [sent-151, score-0.28]
71 Fortunately, with minimax DDP, we can collect sufficient data to improve our dynamics model. [sent-152, score-0.334]
72 Here, we propose using Receptive Field Weighted Regression (RFWR) [7] to learn the error dynamics of the biped robot. [sent-153, score-0.304]
73 In this section we present results on learning a simulated modeling error (the disturbances discussed in section 4). [sent-154, score-0.164]
74 We can represent the full dynamics as the sum of the known dynamics and the error dynamics ∆F(xi , ui , i): xi+1 = F(xi , ui ) + ∆F(xi , ui , i). [sent-156, score-0.909]
75 We align 20 basis functions (N b = 20) at even intervals along the biped trajectories. [sent-158, score-0.247]
76 The learning strategy uses the following sequence: 1) Design the initial controller using minimax DDP applied to the nominal model. [sent-159, score-0.52]
77 4) Redesign the biped controller using minimax DDP with the learned model. [sent-162, score-0.707]
78 Results in table 6 show that the controller after learning the error dynamics used lower torque to produce stable biped walking trajectories. [sent-164, score-0.676]
79 Table 6: One step control cost with the large disturbances (averaged over 100 steps) without learned model with learned model control cost [(N · m)2 × 10−2 ] 17. [sent-165, score-0.362]
80 3 6 Discussion In this study, we developed an optimization method to generate biped walking trajectories by using differential dynamic programming (DDP). [sent-167, score-0.506]
81 We showed that 1) DDP and minimax DDP can be applied to high dimensional problems, 2) minimax DDP can design more robust controllers, and 3) learning can be used to reduce modeling error and unknown disturbances in the context of minimax DDP control design. [sent-168, score-1.071]
82 Both standard DDP and minimax DDP generated low torque biped trajectories. [sent-169, score-0.579]
83 We showed that the minimax DDP control design was more robust than the controller designed by standard DDP and the hand-tuned PD servo. [sent-170, score-0.623]
84 Given a robust controller, we could collect sufficient data to learn the error dynamics using RFWR[7] without the robot falling down all the time. [sent-171, score-0.263]
85 We also showed that after learning the error dynamics, the biped robot could find a lower torque trajectory. [sent-172, score-0.46]
86 DDP provides a feedback controller which is important in coping with unknown distur- bances and modeling errors. [sent-173, score-0.298]
87 However, as shown in equation (2), the feedback controller is indexed by time, and development of a time independent feedback controller is a future goal. [sent-174, score-0.464]
88 Appendix A Ground contact model The function H() in equation (21) includes the mapping (velocity change) caused by ground contact. [sent-175, score-0.115]
89 To derive the first derivative of the value function V x (xN ) and the second derivative Vxx (xN ), where xN denotes the terminal state, the function H() should be analytical. [sent-176, score-0.08]
90 Rigid body collisions of planar kinematic chains with multiple contact points. [sent-195, score-0.084]
wordName wordTfidf (topN-words)
[('ddp', 0.676), ('minimax', 0.252), ('biped', 0.247), ('ui', 0.246), ('controller', 0.208), ('servo', 0.176), ('xi', 0.144), ('pd', 0.135), ('robot', 0.133), ('vxx', 0.122), ('wi', 0.111), ('ankle', 0.108), ('disturbances', 0.107), ('disturbance', 0.097), ('control', 0.089), ('xn', 0.083), ('qu', 0.082), ('torque', 0.08), ('vx', 0.079), ('trajectories', 0.065), ('contact', 0.064), ('nominal', 0.06), ('trajectory', 0.058), ('dynamics', 0.057), ('qw', 0.057), ('differential', 0.056), ('qux', 0.054), ('qwx', 0.054), ('qxu', 0.054), ('qxx', 0.054), ('fu', 0.054), ('denotes', 0.053), ('policy', 0.052), ('walking', 0.052), ('ground', 0.051), ('programming', 0.049), ('robust', 0.048), ('velocity', 0.045), ('controllers', 0.043), ('qx', 0.043), ('fx', 0.043), ('walk', 0.041), ('deg', 0.041), ('quu', 0.041), ('quw', 0.041), ('qxw', 0.041), ('rfwr', 0.041), ('uopt', 0.041), ('fw', 0.04), ('ww', 0.04), ('uu', 0.038), ('ck', 0.038), ('xd', 0.037), ('dynamic', 0.037), ('reinforcement', 0.034), ('simulated', 0.033), ('foot', 0.032), ('swing', 0.032), ('table', 0.032), ('kg', 0.03), ('cost', 0.029), ('terminal', 0.027), ('gwi', 0.027), ('heel', 0.027), ('qwu', 0.027), ('qww', 0.027), ('rui', 0.027), ('unew', 0.027), ('xopt', 0.027), ('nb', 0.027), ('joint', 0.027), ('design', 0.026), ('pn', 0.026), ('state', 0.026), ('fh', 0.025), ('collect', 0.025), ('modeling', 0.024), ('feedback', 0.024), ('viscous', 0.024), ('inertia', 0.024), ('friction', 0.024), ('morimoto', 0.024), ('robustness', 0.023), ('cope', 0.022), ('local', 0.022), ('coping', 0.021), ('atr', 0.021), ('joints', 0.021), ('link', 0.021), ('base', 0.021), ('unknown', 0.021), ('kinematic', 0.02), ('step', 0.019), ('steps', 0.019), ('deviations', 0.019), ('sec', 0.019), ('angular', 0.019), ('ki', 0.019), ('leg', 0.019), ('optimized', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
2 0.23599078 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
3 0.11000842 128 nips-2002-Learning a Forward Model of a Reflex
Author: Bernd Porr, Florentin Wörgötter
Abstract: We develop a systems theoretical treatment of a behavioural system that interacts with its environment in a closed loop situation such that its motor actions influence its sensor inputs. The simplest form of a feedback is a reflex. Reflexes occur always “too late”; i.e., only after a (unpleasant, painful, dangerous) reflex-eliciting sensor event has occurred. This defines an objective problem which can be solved if another sensor input exists which can predict the primary reflex and can generate an earlier reaction. In contrast to previous approaches, our linear learning algorithm allows for an analytical proof that this system learns to apply feedforward control with the result that slow feedback loops are replaced by their equivalent feed-forward controller creating a forward model. In other words, learning turns the reactive system into a pro-active system. By means of a robot implementation we demonstrate the applicability of the theoretical results which can be used in a variety of different areas in physics and engineering.
4 0.10500836 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
5 0.093540087 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
Author: Emanuel Todorov, Michael I. Jordan
Abstract: Behavioral goals are achieved reliably and repeatedly with movements rarely reproducible in their detail. Here we offer an explanation: we show that not only are variability and goal achievement compatible, but indeed that allowing variability in redundant dimensions is the optimal control strategy in the face of uncertainty. The optimal feedback control laws for typical motor tasks obey a “minimal intervention” principle: deviations from the average trajectory are only corrected when they interfere with the task goals. The resulting behavior exhibits task-constrained variability, as well as synergetic coupling among actuators—which is another unexplained empirical phenomenon.
6 0.092559658 185 nips-2002-Speeding up the Parti-Game Algorithm
7 0.069725342 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
9 0.065334707 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
10 0.059643142 10 nips-2002-A Model for Learning Variance Components of Natural Images
11 0.057627339 6 nips-2002-A Formulation for Minimax Probability Machine Regression
12 0.054497443 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
13 0.054181539 137 nips-2002-Location Estimation with a Differential Update Network
14 0.053622507 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
15 0.05225395 3 nips-2002-A Convergent Form of Approximate Policy Iteration
16 0.052064057 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
17 0.045094777 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
18 0.043110061 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions
19 0.042745948 169 nips-2002-Real-Time Particle Filters
20 0.042713296 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
topicId topicWeight
[(0, -0.112), (1, -0.005), (2, -0.167), (3, -0.033), (4, 0.027), (5, -0.008), (6, 0.037), (7, 0.036), (8, 0.144), (9, 0.14), (10, -0.104), (11, 0.034), (12, 0.047), (13, -0.072), (14, -0.121), (15, 0.016), (16, 0.04), (17, -0.081), (18, 0.048), (19, 0.075), (20, 0.003), (21, 0.08), (22, -0.035), (23, -0.078), (24, 0.198), (25, -0.017), (26, -0.021), (27, 0.128), (28, 0.101), (29, -0.116), (30, -0.092), (31, 0.147), (32, -0.005), (33, 0.006), (34, -0.098), (35, 0.041), (36, 0.103), (37, -0.005), (38, 0.231), (39, -0.006), (40, -0.115), (41, -0.063), (42, -0.054), (43, -0.016), (44, 0.099), (45, -0.032), (46, 0.043), (47, 0.03), (48, -0.039), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.97043002 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
2 0.64866978 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
3 0.64603972 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
Author: Auke J. Ijspeert, Jun Nakanishi, Stefan Schaal
Abstract: Many control problems take place in continuous state-action spaces, e.g., as in manipulator robotics, where the control objective is often defined as finding a desired trajectory that reaches a particular goal state. While reinforcement learning offers a theoretical framework to learn such control policies from scratch, its applicability to higher dimensional continuous state-action spaces remains rather limited to date. Instead of learning from scratch, in this paper we suggest to learn a desired complex control policy by transforming an existing simple canonical control policy. For this purpose, we represent canonical policies in terms of differential equations with well-defined attractor properties. By nonlinearly transforming the canonical attractor dynamics using techniques from nonparametric regression, almost arbitrary new nonlinear policies can be generated without losing the stability properties of the canonical system. We demonstrate our techniques in the context of learning a set of movement skills for a humanoid robot from demonstrations of a human teacher. Policies are acquired rapidly, and, due to the properties of well formulated differential equations, can be re-used and modified on-line under dynamic changes of the environment. The linear parameterization of nonparametric regression moreover lends itself to recognize and classify previously learned movement skills. Evaluations in simulations and on an actual 30 degree-offreedom humanoid robot exemplify the feasibility and robustness of our approach. 1
4 0.58214992 185 nips-2002-Speeding up the Parti-Game Algorithm
Author: Maxim Likhachev, Sven Koenig
Abstract: In this paper, we introduce an efficient replanning algorithm for nondeterministic domains, namely what we believe to be the first incremental heuristic minimax search algorithm. We apply it to the dynamic discretization of continuous domains, resulting in an efficient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains.
5 0.48554507 128 nips-2002-Learning a Forward Model of a Reflex
Author: Bernd Porr, Florentin Wörgötter
Abstract: We develop a systems theoretical treatment of a behavioural system that interacts with its environment in a closed loop situation such that its motor actions influence its sensor inputs. The simplest form of a feedback is a reflex. Reflexes occur always “too late”; i.e., only after a (unpleasant, painful, dangerous) reflex-eliciting sensor event has occurred. This defines an objective problem which can be solved if another sensor input exists which can predict the primary reflex and can generate an earlier reaction. In contrast to previous approaches, our linear learning algorithm allows for an analytical proof that this system learns to apply feedforward control with the result that slow feedback loops are replaced by their equivalent feed-forward controller creating a forward model. In other words, learning turns the reactive system into a pro-active system. By means of a robot implementation we demonstrate the applicability of the theoretical results which can be used in a variety of different areas in physics and engineering.
6 0.47394577 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement
7 0.41102049 6 nips-2002-A Formulation for Minimax Probability Machine Regression
8 0.40664893 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming
9 0.35212359 137 nips-2002-Location Estimation with a Differential Update Network
10 0.32798725 178 nips-2002-Robust Novelty Detection with Single-Class MPM
11 0.31371662 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model
12 0.27091244 42 nips-2002-Bias-Optimal Incremental Problem Solving
13 0.25578403 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design
14 0.25446692 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
15 0.24801956 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
16 0.24128734 3 nips-2002-A Convergent Form of Approximate Policy Iteration
17 0.2304889 47 nips-2002-Branching Law for Axons
18 0.20553741 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
19 0.20342615 129 nips-2002-Learning in Spiking Neural Assemblies
20 0.19211106 134 nips-2002-Learning to Take Concurrent Actions
topicId topicWeight
[(23, 0.025), (42, 0.049), (54, 0.084), (55, 0.044), (71, 0.09), (74, 0.074), (86, 0.355), (92, 0.038), (98, 0.084)]
simIndex simValue paperId paperTitle
same-paper 1 0.79071331 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking
Author: Jun Morimoto, Christopher G. Atkeson
Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated five link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.
2 0.7672174 185 nips-2002-Speeding up the Parti-Game Algorithm
Author: Maxim Likhachev, Sven Koenig
Abstract: In this paper, we introduce an efficient replanning algorithm for nondeterministic domains, namely what we believe to be the first incremental heuristic minimax search algorithm. We apply it to the dynamic discretization of continuous domains, resulting in an efficient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains.
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
4 0.43114585 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach
Author: Christopher G. Atkeson, Jun Morimoto
Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.
5 0.43096671 30 nips-2002-Annealing and the Rate Distortion Problem
Author: Albert E. Parker, Tomá\v S. Gedeon, Alexander G. Dimitrov
Abstract: In this paper we introduce methodology to determine the bifurcation structure of optima for a class of similar cost functions from Rate Distortion Theory, Deterministic Annealing, Information Distortion and the Information Bottleneck Method. We also introduce a numerical algorithm which uses the explicit form of the bifurcating branches to find optima at a bifurcation point. 1
6 0.40583891 106 nips-2002-Hyperkernels
7 0.39209169 10 nips-2002-A Model for Learning Variance Components of Natural Images
8 0.39040306 2 nips-2002-A Bilinear Model for Sparse Coding
9 0.38759658 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
10 0.38621014 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
11 0.38532361 3 nips-2002-A Convergent Form of Approximate Policy Iteration
12 0.38361824 27 nips-2002-An Impossibility Theorem for Clustering
13 0.38299754 53 nips-2002-Clustering with the Fisher Score
14 0.38195935 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
15 0.38185972 188 nips-2002-Stability-Based Model Selection
16 0.38168281 89 nips-2002-Feature Selection by Maximum Marginal Diversity
17 0.38126269 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
18 0.38117558 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
19 0.38100135 110 nips-2002-Incremental Gaussian Processes
20 0.38098198 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction