nips nips2006 nips2006-25 knowledge-graph by maker-knowledge-mining

25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight


Source: pdf

Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng

Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Stanford University Stanford, CA 94305 Abstract Autonomous helicopter flight is widely regarded to be a highly challenging control problem. [sent-3, score-0.738]

2 This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. [sent-4, score-1.162]

3 Our experimental results significantly extend the state of the art in autonomous helicopter flight. [sent-5, score-0.952]

4 We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. [sent-6, score-1.565]

5 Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. [sent-7, score-0.305]

6 1 Introduction Autonomous helicopter flight represents a challenging control problem with high-dimensional, asymmetric, noisy, nonlinear, non-minimum phase dynamics. [sent-9, score-0.761]

7 The control of autonomous helicopters thus provides a challenging and important testbed for learning and control algorithms. [sent-15, score-0.403]

8 In the “upright flight regime” there has recently been considerable progress in autonomous helicopter flight. [sent-16, score-0.903]

9 For example, Bagnell and Schneider [6] achieved sustained autonomous hover. [sent-17, score-0.262]

10 [17] achieved sustained autonomous hover and accurate flight in regimes where the helicopter’s orientation is fairly close to upright. [sent-20, score-0.372]

11 In contrast, autonomous flight achievements in other flight regimes have been very limited. [sent-24, score-0.257]

12 In particular, we present the first successful autonomous completion of the following four maneuvers: forward flip and axial roll at low speed, tail-in funnel, and nose-in funnel. [sent-30, score-0.391]

13 Not only are we first to autonomously complete such a single flip and roll, our controllers are also able to continuously repeat the flips and rolls without any pauses in between. [sent-31, score-0.194]

14 The number of flips and rolls and the duration of the funnel trajectories were chosen to be sufficiently large to demonstrate that the helicopter could continue the maneuvers indefinitely (assuming unlimited fuel and battery endurance). [sent-33, score-1.142]

15 In the (forward) flip, the helicopter rotates 360 degrees forward around its lateral axis (the axis going from the right to the left of the helicopter). [sent-35, score-0.883]

16 To prevent altitude loss during the maneuver, the helicopter pushes itself back up by using the (inverted) main rotor thrust halfway through the flip. [sent-36, score-0.951]

17 In the (right) axial roll the helicopter rotates 360 degrees around its longitudinal axis (the axis going from the back to the front of the helicopter). [sent-37, score-0.982]

18 Similarly to the flip, the helicopter prevents altitude loss by pushing itself back up by using the (inverted) main rotor thrust halfway through the roll. [sent-38, score-0.951]

19 In the tail-in funnel, the helicopter repeatedly flies a circle sideways with the tail pointing to the center of the circle. [sent-39, score-0.804]

20 For the trajectory to be a funnel maneuver, the helicopter speed and the circle radius are chosen such that the helicopter must pitch up steeply to stay in the circle. [sent-40, score-1.798]

21 The nose-in funnel is similar to the tail-in funnel, the difference being that the nose points to the center of the circle throughout the maneuver. [sent-41, score-0.338]

22 We discuss our apprenticeship learning approach to choosing the reward function, as well as other design decisions and lessons learned. [sent-46, score-0.196]

23 Section 4 describes our helicopter platform and our experimental results. [sent-47, score-0.67]

24 Movies of our autonomous helicopter flights are available at the following webpage: http://www. [sent-49, score-0.903]

25 1 Data Collection The E 3 -family of algorithms [12] and its extensions [11, 7, 10] are the state of the art RL algorithms for autonomous data collection. [sent-54, score-0.282]

26 Unfortunately, such exploration policies do not even try to fly the helicopter well, and thus would invariably lead to crashes. [sent-56, score-0.721]

27 Collect data from a human pilot flying the desired maneuvers with the helicopter. [sent-58, score-0.232]

28 This procedure has similarities with model-based RL and with the common approach in control to first perform system identification and then find a controller using the resulting model. [sent-66, score-0.21]

29 2 Model Learning The helicopter state s comprises its position (x, y, z), orientation (expressed as a unit quaternion), velocity (x, y, z) and angular velocity (ωx , ωy , ωz ). [sent-72, score-0.89]

30 The helicopter is controlled by a 4-dimensional ˙ ˙ ˙ action space (u1 , u2 , u3 , u4 ). [sent-73, score-0.67]

31 By using the cyclic pitch (u1 , u2 ) and tail rotor (u3 ) controls, the pilot can rotate the helicopter around each of its main axes and bring the helicopter to any orientation. [sent-74, score-1.706]

32 This allows the pilot to direct the thrust of the main rotor in any particular direction (and thus fly in any particular direction). [sent-75, score-0.344]

33 By adjusting the collective pitch angle (control input u4 ), the pilot can adjust the thrust generated by the main rotor. [sent-76, score-0.382]

34 For a positive collective pitch angle the main rotor will blow air downward relative to the helicopter. [sent-77, score-0.4]

35 For a negative collective pitch angle the main rotor will blow air upward relative to the helicopter. [sent-78, score-0.375]

36 Accelerations are then integrated to obtain the helicopter states over time. [sent-81, score-0.67]

37 The key idea from [1] is that, after subtracting out the effects of gravity, the forces and moments acting on the helicopter are independent of position and orientation of the helicopter, when expressed in a “body coordinate frame”, a coordinate frame attached to the body of the helicopter. [sent-82, score-0.769]

38 We estimate the coefficients A· , B· , C· , D· and E· from helicopter flight data. [sent-88, score-0.67]

39 The coefficient D0 captures sideways acceleration of the helicopter due to thrust generated by the tail rotor. [sent-92, score-0.882]

40 The term E0 (xb , y b , z b ) 2 models translational lift: the additional lift the helicopter gets ˙ ˙ ˙ when flying at higher speed. [sent-93, score-0.719]

41 Specifically, during hover, the helicopter’s rotor imparts a downward velocity on the air above and below it. [sent-94, score-0.264]

42 This downward velocity reduces the effective pitch (angle of attack) of the rotor blades, causing less lift to be produced [14, 20]. [sent-95, score-0.357]

43 As the helicopter transitions into faster flight, this region of altered airflow is left behind and the blades enter “clean” air. [sent-96, score-0.726]

44 Thus, the angle of attack is higher and more lift is produced for a given choice of the collective control (u4 ). [sent-97, score-0.212]

45 The translational lift term was important for modeling the helicopter dynamics during the funnels. [sent-98, score-0.75]

46 The coefficient C24 captures the pitch acceleration due to main rotor thrust. [sent-99, score-0.285]

47 This coefficient is nonzero since (after equipping our helicopter with our sensor packages) the center of gravity is further backward than the center of main rotor thrust. [sent-100, score-0.872]

48 (2) Our model’s state does not include the blade-flapping angles, which are the angles the rotor blades make with the helicopter body while sweeping through the air. [sent-104, score-0.945]

49 Both inertial coupling and blade flapping have previously been shown to improve accuracy of helicopter models for other RC helicopters. [sent-105, score-0.722]

50 1s time scale used for control—the blade flapping angles’ effects are sufficiently well captured by using a first order model from cyclic inputs to roll and pitch rates. [sent-109, score-0.246]

51 Such a first order model maps cyclic inputs to angular accelerations (rather than the steady state angular rate), effectively capturing the delay introduced by the blades reacting (moving) first before the helicopter body follows. [sent-110, score-0.97]

52 It is well-known that the optimal policy for the LQR control problem is a linear feedback controller which can be efficiently computed using dynamic programming. [sent-125, score-0.284]

53 The standard extension (which we 0 H use) expresses the dynamics and reward function as a function of the error state e(t) = s(t) − s∗ (t) rather than the actual state s(t). [sent-130, score-0.215]

54 Compute a linear approximation to the dynamics and a quadratic approximation to the reward function around the trajectory obtained when using the current policy. [sent-135, score-0.187]

55 Compute the optimal policy for the LQR problem obtained in Step 1 and set the current policy equal to the optimal policy for the LQR problem. [sent-137, score-0.222]

56 This axis angle representation results in the linearizations being more accurate approximations of the non-linear model since the axis angle representation maps more directly to the angular rates than naively differencing the quaternions or Euler angles. [sent-144, score-0.243]

57 Using DDP as thus far explained resulted in unstable controllers on the real helicopter: The controllers tended to rapidly switch between low and high values, which resulted in poor flight performance. [sent-146, score-0.232]

58 Our funnel controllers performed significantly better with integral control. [sent-173, score-0.415]

59 3 Trade-offs in the reward function Our reward function contained 24 features, consisting of the squared error state variables, the squared inputs, the squared change in inputs between consecutive timesteps, and the squared integral of the error state variables. [sent-180, score-0.343]

60 For the reinforcement learning algorithm to find a controller that flies “well,” it is critical that the correct trade-off between these features is specified. [sent-181, score-0.219]

61 1 Experimental Platform The helicopter used is an XCell Tempest, a competition-class aerobatic helicopter (length 54”, height 19”, weight 13 lbs), powered by a 0. [sent-190, score-1.419]

62 We instrumented the helicopter with a Microstrain 3DM-GX1 orientation sensor, and a Novatel RT2 GPS receiver. [sent-193, score-0.711]

63 Later, we used three PointGrey DragonFly2 cameras that track the helicopter from the ground. [sent-199, score-0.67]

64 The model used to design the flip and roll controllers is estimated from 5 minutes of flight data during which the pilot performs frequency sweeps on each of the four control inputs (which covers as similar a flight regime as possible without having to invert the helicopter). [sent-207, score-0.483]

65 For the funnel controllers, we learn a model from the same frequency sweeps and from our pilot flying the funnels. [sent-208, score-0.403]

66 For the funnels, our initial controllers did not perform as well, and we performed two iterations of the apprenticeship learning algorithm described in Section 2. [sent-210, score-0.194]

67 1 Flip In the ideal forward flip, the helicopter rotates 360 degrees forward around its lateral axis (the axis going from the right to the left of the helicopter) while staying in place. [sent-218, score-0.924]

68 The top row of Figure 1 (a) shows a series of snapshots of our helicopter during an autonomous flip. [sent-219, score-0.993]

69 In the first frame, the helicopter is hovering upright autonomously. [sent-220, score-0.763]

70 At this point, the helicopter does not have the ability to counter its descent since it can only produce thrust in the direction of the main rotor. [sent-222, score-0.76]

71 The flip continues until the helicopter is completely inverted. [sent-223, score-0.67]

72 At this moment, the controller must apply negative collective to regain altitude lost during the half-flip, while continuing the flip and returning to the upright position. [sent-224, score-0.323]

73 2 Roll In the ideal axial roll, the helicopter rotates 360 degrees around its longitudinal axis (the axis going from the back to the front of the helicopter) while staying in place. [sent-230, score-0.892]

74 The bottom row of Figure 1 (b) shows a series of snapshots of our helicopter during an autonomous roll. [sent-231, score-0.993]

75 In the first frame, the helicopter is hovering upright autonomously. [sent-232, score-0.763]

76 When inverted, the helicopter applies negative collective to regain altitude lost during the first half of the roll, while continuing the roll and returning to the upright position. [sent-234, score-0.941]

77 3 Tail-In Funnel The tail-in funnel maneuver is essentially a medium to high speed circle flown sideways, with the tail of the helicopter pointed towards the center of the circle. [sent-238, score-1.071]

78 Throughout, the helicopter is pitched backwards such that the main rotor thrust not only compensates for gravity, but also provides the centripetal acceleration to stay in the circle. [sent-239, score-0.985]

79 For a funnel of radius r at velocity v the centripetal acceleration is v 2 /r, so—assuming the main rotor thrust only provides the centripetal acceleration and compensation for gravity—we obtain a pitch angle θ = atan(v 2 /(rg)). [sent-240, score-0.842]

80 4 For the funnel reported in this paper, we had H = 80 s, r = 5 m, and v = 5. [sent-242, score-0.27]

81 Figure 1 (c) shows an overlay of snapshots of the helicopter throughout a tail-in funnel. [sent-244, score-0.838]

82 The defining characteristic of the funnel is repeatability—the ability to pass consistently through the same points in space after multiple circuits. [sent-245, score-0.27]

83 Our autonomous funnels are significantly more accurate than funnels flown by expert human pilots. [sent-246, score-0.345]

84 In figure 2 (b) we superimposed the heading of the helicopter on a partial trajectory (showing the entire trajectory with heading superimposed gives a cluttered plot). [sent-248, score-0.946]

85 Our autonomous funnels have an RMS position error of 1. [sent-249, score-0.272]

86 4 Nose-In Funnel The nose-in funnel maneuver is very similar to the tail-in funnel maneuver, except that the nose points to the center of the circle, rather than the tail. [sent-254, score-0.619]

87 Our autonomous nose-in funnel controller results in highly repeatable trajectories (similar to the tail-in funnel), and it achieves a level of performance that is difficult for a human pilot to match. [sent-255, score-0.753]

88 5 Conclusion To summarize, we presented our successful DDP-based control design for four new aerobatic maneuvers: forward flip, sideways roll (at low speed), tail-in funnel, and nose-in funnel. [sent-257, score-0.359]

89 The key design decisions for the DDP-based controller to fly our helicopter successfully are the following: 4 The maneuver is actually broken into three parts: an accelerating leg, the funnel leg, and a decelerating leg. [sent-258, score-1.216]

90 During the accelerating and decelerating legs, the helicopter accelerates at amax (= 0. [sent-259, score-0.693]

91 5 Without the integral of heading error in the cost function we observed significantly larger heading errors of 20-40 degrees, which resulted in the linearization being so inaccurate that controllers often failed entirely. [sent-261, score-0.32]

92 (c) Overlay of snapshots of the helicopter throughout a tail-in funnel. [sent-265, score-0.799]

93 (d) Overlay of snapshots of the helicopter throughout a nose-in funnel. [sent-266, score-0.799]

94 ) 6 4 4 North (m) 8 6 North (m) 8 2 0 −2 2 0 −2 −4 −4 −6 −6 −8 −8 −8 −6 −4 −2 0 2 4 6 8 East (m) (a) −8 −6 −4 −2 0 2 4 6 8 East (m) (b) (c) Figure 2: (a) Trajectory followed by the helicopter during tail-in funnel. [sent-268, score-0.67]

95 We used apprenticeship learning algorithms, which take advantage of an expert demonstration, to determine the reward function and to learn the model. [sent-273, score-0.198]

96 To the best of our knowledge, these are the most challenging autonomous flight maneuvers achieved to date. [sent-276, score-0.357]

97 Acknowledgments We thank Ben Tse for piloting our helicopter and working on the electronics of our helicopter. [sent-277, score-0.67]

98 Autonomous helicopter control using reinforcement learning policy search methods. [sent-312, score-0.889]

99 Flight test and simulation results for an autonomous aerobatic helicopter. [sent-324, score-0.312]

100 Low-cost flight control system for a small autonomous helicopter. [sent-388, score-0.301]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('helicopter', 0.67), ('funnel', 0.27), ('ight', 0.259), ('autonomous', 0.233), ('ddp', 0.146), ('rotor', 0.146), ('controller', 0.142), ('maneuvers', 0.124), ('controllers', 0.116), ('pilot', 0.108), ('ip', 0.093), ('snapshots', 0.09), ('thrust', 0.09), ('roll', 0.09), ('pitch', 0.089), ('reward', 0.086), ('aerobatic', 0.079), ('maneuver', 0.079), ('apprenticeship', 0.078), ('ips', 0.078), ('rolls', 0.078), ('reinforcement', 0.077), ('policy', 0.074), ('trajectory', 0.07), ('lqr', 0.069), ('control', 0.068), ('heading', 0.068), ('accelerations', 0.059), ('upright', 0.059), ('inverted', 0.058), ('blades', 0.056), ('gravity', 0.056), ('collective', 0.054), ('xb', 0.054), ('axis', 0.052), ('acceleration', 0.05), ('lift', 0.049), ('sideways', 0.049), ('state', 0.049), ('velocity', 0.048), ('air', 0.045), ('altitude', 0.045), ('hover', 0.045), ('rotates', 0.045), ('ying', 0.045), ('inputs', 0.044), ('gps', 0.042), ('orientation', 0.041), ('forward', 0.041), ('angle', 0.041), ('funnels', 0.039), ('linearization', 0.039), ('overlay', 0.039), ('throughout', 0.039), ('abbeel', 0.037), ('rl', 0.035), ('frame', 0.034), ('angular', 0.034), ('expert', 0.034), ('apping', 0.034), ('flight', 0.034), ('gavrilets', 0.034), ('helicopters', 0.034), ('hovering', 0.034), ('linearized', 0.034), ('mettler', 0.034), ('novatel', 0.034), ('pointing', 0.033), ('design', 0.032), ('dynamics', 0.031), ('centripetal', 0.029), ('inertial', 0.029), ('sustained', 0.029), ('circle', 0.029), ('integral', 0.029), ('exploration', 0.028), ('axial', 0.027), ('robotics', 0.026), ('sweeps', 0.025), ('downward', 0.025), ('kalman', 0.025), ('regimes', 0.024), ('differential', 0.024), ('body', 0.024), ('degrees', 0.023), ('tail', 0.023), ('phase', 0.023), ('policies', 0.023), ('blade', 0.023), ('coates', 0.023), ('decelerating', 0.023), ('ies', 0.023), ('imu', 0.023), ('linearizations', 0.023), ('longitudinal', 0.023), ('maneuvering', 0.023), ('microstrain', 0.023), ('psu', 0.023), ('regain', 0.023), ('saripalli', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng

Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1

2 0.11553057 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

Author: Laurent Charlin, Pascal Poupart, Romy Shioda

Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1

3 0.079295583 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

Author: Silvia Richter, Douglas Aberdeen, Jin Yu

Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1

4 0.075307116 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

5 0.066113561 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

6 0.059707239 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

7 0.05674291 44 nips-2006-Bayesian Policy Gradient Algorithms

8 0.056619752 154 nips-2006-Optimal Change-Detection and Spiking Neurons

9 0.055859424 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

10 0.049546335 129 nips-2006-Map-Reduce for Machine Learning on Multicore

11 0.044948652 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

12 0.044587534 124 nips-2006-Linearly-solvable Markov decision problems

13 0.041003253 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

14 0.037663665 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

15 0.030038144 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

16 0.029603759 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

17 0.028565655 47 nips-2006-Boosting Structured Prediction for Imitation Learning

18 0.024642445 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

19 0.024584051 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

20 0.022730216 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.086), (1, -0.032), (2, -0.08), (3, -0.082), (4, 0.011), (5, -0.075), (6, 0.027), (7, -0.04), (8, 0.165), (9, 0.021), (10, -0.015), (11, -0.049), (12, -0.038), (13, -0.016), (14, -0.003), (15, -0.005), (16, -0.013), (17, -0.031), (18, 0.022), (19, 0.01), (20, 0.012), (21, -0.012), (22, 0.019), (23, 0.014), (24, -0.024), (25, 0.024), (26, -0.029), (27, -0.002), (28, -0.009), (29, -0.034), (30, 0.018), (31, -0.034), (32, 0.062), (33, -0.025), (34, -0.04), (35, -0.092), (36, -0.004), (37, -0.013), (38, 0.069), (39, 0.003), (40, 0.125), (41, 0.016), (42, 0.036), (43, 0.049), (44, -0.131), (45, -0.063), (46, 0.086), (47, -0.007), (48, -0.039), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93475199 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng

Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1

2 0.6602751 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

Author: Silvia Richter, Douglas Aberdeen, Jin Yu

Abstract: Current road-traffic optimisation practice around the world is a combination of hand tuned policies with a small degree of automatic adaption. Even state-ofthe-art research controllers need good models of the road traffic, which cannot be obtained directly from existing sensors. We use a policy-gradient reinforcement learning approach to directly optimise the traffic signals, mapping currently deployed sensor observations to control signals. Our trained controllers are (theoretically) compatible with the traffic system used in Sydney and many other cities around the world. We apply two policy-gradient methods: (1) the recent natural actor-critic algorithm, and (2) a vanilla policy-gradient algorithm for comparison. Along the way we extend natural-actor critic approaches to work for distributed and online infinite-horizon problems. 1

3 0.59767157 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

Author: Laurent Charlin, Pascal Poupart, Romy Shioda

Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1

4 0.49326551 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

Author: Huan Xu, Shie Mannor

Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1

5 0.48209533 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner

Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best fit between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically significant differences between different genetic strains and stress conditions. 1

6 0.45707417 124 nips-2006-Linearly-solvable Markov decision problems

7 0.44042337 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

8 0.43547139 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

9 0.42046526 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

10 0.38394329 44 nips-2006-Bayesian Policy Gradient Algorithms

11 0.34400868 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

12 0.33085772 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

13 0.32276368 47 nips-2006-Boosting Structured Prediction for Imitation Learning

14 0.31950364 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

15 0.31712389 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification

16 0.31581095 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

17 0.31468099 153 nips-2006-Online Clustering of Moving Hyperplanes

18 0.29471233 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

19 0.28729492 170 nips-2006-Robotic Grasping of Novel Objects

20 0.27732047 120 nips-2006-Learning to Traverse Image Manifolds


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.048), (2, 0.019), (3, 0.052), (7, 0.048), (9, 0.023), (12, 0.014), (22, 0.033), (24, 0.379), (25, 0.014), (44, 0.048), (47, 0.012), (57, 0.055), (59, 0.026), (65, 0.033), (69, 0.034), (71, 0.031), (82, 0.013), (91, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80688107 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng

Abstract: Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1

2 0.63447422 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

Author: Amit Gore, Shantanu Chakrabartty

Abstract: A key challenge in designing analog-to-digital converters for cortically implanted prosthesis is to sense and process high-dimensional neural signals recorded by the micro-electrode arrays. In this paper, we describe a novel architecture for analog-to-digital (A/D) conversion that combines Σ∆ conversion with spatial de-correlation within a single module. The architecture called multiple-input multiple-output (MIMO) Σ∆ is based on a min-max gradient descent optimization of a regularized linear cost function that naturally lends to an A/D formulation. Using an online formulation, the architecture can adapt to slow variations in cross-channel correlations, observed due to relative motion of the microelectrodes with respect to the signal sources. Experimental results with real recorded multi-channel neural data demonstrate the effectiveness of the proposed algorithm in alleviating cross-channel redundancy across electrodes and performing data-compression directly at the A/D converter. 1

3 0.46442229 179 nips-2006-Sparse Representation for Signal Classification

Author: Ke Huang, Selin Aviyente

Abstract: In this paper, application of sparse representation (factorization) of signals over an overcomplete basis (dictionary) for signal classification is discussed. Searching for the sparse representation of a signal over an overcomplete dictionary is achieved by optimizing an objective function that includes two terms: one that measures the signal reconstruction error and another that measures the sparsity. This objective function works well in applications where signals need to be reconstructed, like coding and denoising. On the other hand, discriminative methods, such as linear discriminative analysis (LDA), are better suited for classification tasks. However, discriminative methods are usually sensitive to corruption in signals due to lacking crucial properties for signal reconstruction. In this paper, we present a theoretical framework for signal classification with sparse representation. The approach combines the discrimination power of the discriminative methods with the reconstruction property and the sparsity of the sparse representation that enables one to deal with signal corruptions: noise, missing data and outliers. The proposed approach is therefore capable of robust classification with a sparse representation of signals. The theoretical results are demonstrated with signal classification tasks, showing that the proposed approach outperforms the standard discriminative methods and the standard sparse representation in the case of corrupted signals. 1

4 0.30226791 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

Author: Laurent Charlin, Pascal Poupart, Romy Shioda

Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1

5 0.29741779 17 nips-2006-A recipe for optimizing a time-histogram

Author: Hideaki Shimazaki, Shigeru Shinomoto

Abstract: The time-histogram method is a handy tool for capturing the instantaneous rate of spike occurrence. In most of the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time-histogram to the underlying rate has been selected by individual researchers in an unsystematic manner. We propose an objective method for selecting the bin size of a time-histogram from the spike data, so that the time-histogram best approximates the unknown underlying rate. The resolution of the histogram increases, or the optimal bin size decreases, with the number of spike sequences sampled. It is notable that the optimal bin size diverges if only a small number of experimental trials are available from a moderately fluctuating rate process. In this case, any attempt to characterize the underlying spike rate will lead to spurious results. Given a paucity of data, our method can also suggest how many more trials are needed until the set of data can be analyzed with the required resolution. 1

6 0.29649711 154 nips-2006-Optimal Change-Detection and Spiking Neurons

7 0.29597941 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

8 0.29580259 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

9 0.29530832 34 nips-2006-Approximate Correspondences in High Dimensions

10 0.29441947 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

11 0.29437613 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons

12 0.29221839 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron

13 0.29192549 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

14 0.2914196 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency

15 0.29137456 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

16 0.29096437 175 nips-2006-Simplifying Mixture Models through Function Approximation

17 0.29093501 167 nips-2006-Recursive ICA

18 0.2906912 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

19 0.28972208 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

20 0.28963697 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions