nips nips2003 nips2003-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
Reference: text
sentIndex sentText sentNum sentScore
1 Autonomous helicopter flight via Reinforcement Learning Andrew Y. [sent-1, score-0.892]
2 Jordan, and Shankar Sastry University of California Berkeley, CA 94720 Abstract Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. [sent-4, score-0.968]
3 In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. [sent-5, score-0.984]
4 We first fit a stochastic, nonlinear model of the helicopter dynamics. [sent-6, score-0.892]
5 We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition. [sent-7, score-1.046]
6 1 Introduction Helicopters represent a challenging control problem with high-dimensional, complex, asymmetric, noisy, non-linear, dynamics, and are widely regarded as significantly more difficult to control than fixed-wing aircraft. [sent-8, score-0.133]
7 [7] Consider, for instance, the problem of designing a helicopter that hovers in place. [sent-9, score-0.892]
8 We begin with a single, horizontally-oriented main rotor attached to the helicopter via the rotor shaft. [sent-10, score-1.299]
9 Suppose the main rotor rotates clockwise (viewed from above), blowing air downwards and hence generating upward thrust. [sent-11, score-0.285]
10 By applying clockwise torque to the main rotor to make it rotate, our helicopter experiences an anti-torque that tends to cause the main chassis to spin anti-clockwise. [sent-12, score-1.141]
11 Thus, in the invention of the helicopter, it was necessary to add a tail rotor, which blows air sideways/rightwards to generate an appropriate moment to counteract the spin. [sent-13, score-0.083]
12 But, this sideways force now causes the helicopter to drift leftwards. [sent-14, score-0.955]
13 So, for a helicopter to hover in place, it must actually be tilted slightly to the right, so that the main rotor’s thrust is directed downwards and slightly to the left, to counteract this tendency to drift sideways. [sent-15, score-1.082]
14 The history of helicopters is rife with such tales of ingenious solutions to problems caused by solutions to other problems, and of complex, nonintuitive dynamics that make helicopters challenging to control. [sent-16, score-0.168]
15 In this paper, we describe the successful application of reinforcement learning to designing a controller for autonomous helicopter flight. [sent-17, score-1.042]
16 2 Autonomous Helicopter The helicopter used in this work was a Yamaha R-50 helicopter, which is approximately 3. [sent-20, score-0.892]
17 The helicopter carries an Inertial Navigation System (INS) consisting of 3 accelerometers and 3 rate gyroscopes installed in exactly orthogonal x,y,z directions, and a differential GPS system, which with the assistance of a ground station, gives position estimates with a resolution of 2cm. [sent-23, score-0.941]
18 Most Helicopters are controlled via a 4-dimensional action space: : The longtitudinal (front-back) and latitudinal (left-right) cyclic pitch controls. [sent-27, score-0.112]
19 The rotor plane is the plane in which the helicopter’s rotors rotate. [sent-28, score-0.229]
20 By tilting this plane either forwards/backwards or sideways, these controls cause the helicopter to accelerate forward/backwards or sideways. [sent-29, score-0.945]
21 As the helicopter main-rotor’s blades sweep through the air, they generate an amount of upward thrust that (generally) increases with the angle at which the rotor blades are tilted. [sent-31, score-1.195]
22 By varying the tilt angle of the rotor blades, the collective pitch control affects the main rotor’s thrust. [sent-32, score-0.425]
23 Using a mechanism similar to the main rotor collective pitch control, this controls the tail rotor’s thrust. [sent-34, score-0.41]
24 Using the position estimates given by the Kalman filter, our task is to pick good control actions every 50th of a second. [sent-35, score-0.106]
25 ¦ £¡ §¥¡¡ ¤¢ ¨¡ §¢ ©¡ ¢ 3 Model identification To fit a model of the helicopter’s dynamics, we began by asking a human pilot to fly the helicopter for several minutes, and recorded the 12-dimensional helicopter state and 4dimensional helicopter control inputs as it was flown. [sent-36, score-2.831]
26 For instance, a helicopter at (0,0,0) facing east behaves in a way related only by a translation and rotation to one at (10,10,50) facing north, if we command each to accelerate forwards. [sent-39, score-0.951]
27 Thus, model identification is typically done not in the spatial (world) coordinates , but instead in the helicopter body coordinates, in which the , , and axes are forwards, sideways, and down relative to the current position of the helicopter. [sent-41, score-0.974]
28 Each point plotted shows the mean-squared error between the predicted value of a state variable—when a model is used to the simulate the helicopter’s dynamics for a certain duration indicated on the -axis—and the true value of that state variable (as measured on test data) after the same duration. [sent-79, score-0.117]
29 (b) The solid line is the true helicopter state on 10s of test data. [sent-84, score-0.944]
30 The dash-dot line is the helicopter state predicted by our model, given the initial state at time 0 and all the intermediate control inputs. [sent-85, score-1.039]
31 The solid lines show the hovering policy class (Section 5). [sent-92, score-0.243]
32 The dashed lines show the extra weights added for trajectory following (Section 6). [sent-93, score-0.1]
33 Similarly, we know that the roll angle of the helicopter should have no direct effect on forward velocity . [sent-100, score-0.964]
34 £ 4 53 danger and expense (about $70,000) of autonomous helicopters, we wanted to verify the fitted model carefully, so as to be reasonably confident that a controller tested successfully in simulation will also be safe in real life. [sent-124, score-0.105]
35 ) To check against this, we examined many plots such as shown in Figure 2, to check that the helicopter state “rarely” goes outside the errorbars predicted by our model at various time scales (see caption). [sent-127, score-0.945]
36 Consider an MDP with state space , initial state , action space , state transition probabilities , reward function , and discount . [sent-129, score-0.144]
37 Also let some family of policies be given, and suppose our goal is to find a policy in with high utility, where the policy of is defined to be ¦ £ ¡ ¥¤¢ ! [sent-130, score-0.236]
38 Then a standard way to define an estimate of is via Monte Carlo: We can use the simulator to sample a trajectory , and by taking the on this sequence, we obtain empirical sum of discounted rewards one “sample” with which to estimate . [sent-136, score-0.084]
39 This succeeded in flying the helicopter in simulation, but not on the actual helicopter (Shim, pers. [sent-153, score-1.806]
40 Similarly, preliminary experiments using and controllers to fly a similar helicopter were also unsuccessful. [sent-156, score-0.892]
41 These comments should not be taken as conclusive of the viability of any of these methods; rather, we take them to be indicative of the difficulty and subtlety involved in learning a helicopter controller. [sent-157, score-0.892]
42 5 0 30 5 10 15 Figure 3: Comparison of hovering performance of learned controller (solid line) vs. [sent-177, score-0.16]
43 ¢ ¨ ¡ ¢ ¨ ¡ We began by learning a policy for hovering in place. [sent-181, score-0.208]
44 We want a controller that, given the current helicopter state and a desired hovering position and orientation , computes controls to make it hover stably there. [sent-182, score-1.213]
45 For our policy class , we chose the simple neural network depicted in Figure 2c (solid edges only). [sent-183, score-0.128]
46 Each of the edges in the figure represents a weight, and the connections were chosen via simple reasoning about which control channel should be used to control which state variables. [sent-184, score-0.151]
47 For instance, consider the the longitudinal (forward/backward) cyclic pitch control , which causes the rotor plane to tilt forward/backward, thus causing the helicopter to pitch (and/or accelerate) forward or backward. [sent-185, score-1.399]
48 From Figure 2c, we can read off the control control as (£ " ¡ ¦¤¡ ¥¤¡ ¤ $ £ ¦ £ £ £ © ( ¡ ( ` £ ¡ £¡ 6 ` 21wu S u 0 f ) x £ § ! [sent-186, score-0.114]
49 ¨ Y1wu S © ¨$ ¥w¨ ¨ Cwu S ¦ ¨ f £ © £ § f ( u x f u Here, the ’s are the tunable parameters (weights) of the network, and is defined to be the error in the -position (forward direction, in body coordinates) between where the helicopter currently is and where we wish it to hover. [sent-191, score-0.931]
50 BgSeX'V ¡ ¢ cXV 2XXaV ¡ cX'7D ¨ G ¨ aSX7D ¨ G ¨ SX'7D G TSQIHE(B@ A ` P W V A Y P W V A R PA G F D CA ( $ This encourages the helicopter to hover near , while also keeping the velocity small and not making abrupt movements. [sent-193, score-1.02]
51 (distinct from the weights parameterizing our policy class) were chosen to scale each of the terms to be roughly the same order of magnitude. [sent-195, score-0.129]
52 To encourage small actions and smooth control of the helicopter, we also used a quadratic penalty for actions: , and the overall reward was . [sent-196, score-0.104]
53 One last component of the reward that we did not mention earlier was that, if in performing the locally weighted regression, the matrix is singular to numerical precision, then we declare the helicopter to have “crashed,” terminate the simulation, and give it a huge negative (-50000) reward. [sent-203, score-0.942]
54 5 −80 −81 Figure 4: Top row: Maneuver diagrams from RC helicopter competition. [sent-220, score-0.906]
55 The most expensive step in policy search was the repeated Monte Carlo evaluation to obtain . [sent-228, score-0.127]
56 Figure 1b shows the result of implementing and of flying time each, and running the resulting policy on the helicopter. [sent-231, score-0.111]
57 On its maiden flight, our learned policy was successful in keeping the helicopter stabilized in the air. [sent-232, score-1.054]
58 (We note that [1] was also successful at using our P EGASUS algorithm to control a subset, the cyclic pitch controls, of a helicopter’s dynamics. [sent-233, score-0.183]
59 ) We also compare the performance of our learned policy against that of our human pilot trained and licensed by Yamaha to fly the R-50 helicopter. [sent-234, score-0.177]
60 Figure 5 shows the velocities and positions of the helicopter under our learned policy and under the human pilot’s control. [sent-235, score-1.051]
61 As we see, our controller was able to keep the helicopter flying more stably than was a human pilot. [sent-236, score-0.984]
62 Videos of the helicopter flying are available at http://www. [sent-237, score-0.892]
63 8 ( D#$ 8 ( D#$ ' ' ( (#$ 8 ' ' 6 Flying competition maneuvers We were next interested in making the helicopter learn to fly several challenging maneuvers. [sent-241, score-1.021]
64 The Academy of Model Aeronautics (AMA) (to our knowledge the largest RC helicopter organization) holds an annual RC helicopter competition, in which helicopters have to be accurately flown through a number of maneuvers. [sent-242, score-1.845]
65 We took the first three maneuvers from the most challenging, Class III, segment of their competition. [sent-244, score-0.082]
66 Figure 4 shows maneuver diagrams from the AMA web site. [sent-245, score-0.096]
67 In the first of these maneuvers 4 A problem exacerbated by the discontinuities described in the previous footnote. [sent-246, score-0.082]
68 1), the helicopter starts from the middle of the base of a triangle, flies backwards to the lower-right corner, performs a pirouette (turning in place), flies backwards up an edge of the triangle, backwards down the other edge, performs another pirouette, and flies backwards to its starting position. [sent-248, score-1.066]
69 Flying backwards is a significantly less stable maneuver than flying forwards, which makes this maneuver interesting and challenging. [sent-249, score-0.203]
70 2), the helicopter has to perform a nose-in turn, in which it flies backwards out to the edge of a circle, pauses, and then flies in a circle but always keeping the nose of the helicopter pointed at center of rotation. [sent-251, score-1.84]
71 Many human pilots seem to find this second maneuver particularly challenging. [sent-253, score-0.116]
72 3 involves flying the helicopter in a vertical rectangle, with two pirouettes in opposite directions halfway along the rectangle’s vertical segments. [sent-255, score-0.922]
73 Given a controller for keeping a system’s state at a point , one standard way to make the system move through a particular trajectory is to slowly vary along a sequence of set points on that trajectory. [sent-257, score-0.191]
74 ) For instance, if we ask our helicopter to hover at , then a fraction of a second later ask it to hover at , then at and so on, our helicopter will slowly fly in the -direction. [sent-261, score-1.943]
75 By taking this procedure and “wrapping” it around our old policy class from Figure 2c, we thus obtain a computer program—that is, a new policy class—not just for hovering, but also for flying arbitrary trajectories. [sent-262, score-0.239]
76 , we now have a family of policies that take as input a trajectory, and that attempt to make the helicopter fly that trajectory. [sent-265, score-0.906]
77 Since we are now flying trajectories and not only hovering, we also augmented the policy class to take into account more of the coupling between the helicopter’s different subdynamics. [sent-267, score-0.153]
78 For instance, the simplest way to turn is to change the tail rotor collective pitch/thrust, so that it yaws either left or right. [sent-268, score-0.285]
79 This works well for small turns, but for large turns, the thrust from the tail rotor also tends to cause the helicopter to drift sideways. [sent-269, score-1.193]
80 Thus, we enriched the policy class to allow it to correct for this drift by applying the appropriate cyclic pitch controls. [sent-270, score-0.272]
81 Also, having a helicopter climb or descend changes the amount of work done by the main rotor, and hence the amount of torque/anti-torque generated, which can cause the helicopter to turn. [sent-271, score-1.842]
82 So, we also added a link between the collective pitch control and the tail rotor control. [sent-272, score-0.433]
83 We also needed to specify a reward function for trajectory following. [sent-274, score-0.097]
84 Specifically, consider making the helicopter fly in the increasing -direction, so that starts off as (say), and has its first coordinate slowly increased over time. [sent-277, score-0.907]
85 Then, while will indeed increase, it will also almost certainly lag conthe actual helicopter position sistently behind . [sent-278, score-0.949]
86 This is because the hovering controller is always trying to “catch up” to the moving . [sent-279, score-0.14]
87 Thus, may remain large, and the helicopter will cost, even if it is in fact flying a very accurate trajectory in the continuously incur a increasing -direction exactly as desired. [sent-280, score-0.956]
88 It would be undesirable to have the helicopter risk trying to fly more aggressively to reduce this fake “error,” particularly if it is at the cost of increased error in the other coordinates. [sent-281, score-0.892]
89 (In our example of flying in a straight line, for a helicopter at , we easily see . [sent-283, score-0.892]
90 We also needed to make sure the helicopter is rewarded for making progress along the ' ¢ £ ¡( ' ¢ ¥ ¤( ' ¢ § © ¨¦ ( ¡ ¡ ¡ ( 6 F$ (£ (£ ! [sent-285, score-0.892]
91 Since, we are already tracking where along the desired trajectory the helicopter is, we chose a potential function that increases along the trajectory. [sent-289, score-0.956]
92 So, we are now also free to consider allowing to evolve in a way that is different from the path of the desired trajectory, but nonetheless in way that allows the helicopter to follow the actual, desired trajectory more accurately. [sent-293, score-0.956]
93 (In control theory, there is a related practice of using the inverse dynamics to obtain better tracking behavior. [sent-294, score-0.084]
94 Specifically, it turns out that the (vertical)-response of the helicopter is very fast: To climb, we need only increase the collective pitch control, which almost immediately causes the helicopter to start accelerating upwards. [sent-297, score-1.922]
95 1, the helicopter will tend to track the -component of the trajectory much more quickly, so that it accelerates into a climb steeper than , resulting in a “bowed-out” trajectory. [sent-300, score-0.997]
96 Using this setup and retraining our policy class’ parameters for accurate trajectory following, we were able to learn a policy that flies all three of the competition maneuvers fairly accurately. [sent-303, score-0.396]
97 Figure 4 (bottom) shows actual trajectories taken by the helicopter while flying these maneuvers. [sent-304, score-0.939]
98 Videos of the helicopter flying these maneuvers are also available at the URL given at the end of Section 5. [sent-305, score-0.974]
99 Autonomous helicopter control using reinforcement learning policy search methods. [sent-309, score-1.107]
100 P EGASUS: A policy search method for large MDPs and POMDPs. [sent-362, score-0.127]
wordName wordTfidf (topN-words)
[('helicopter', 0.892), ('rotor', 0.195), ('ying', 0.143), ('policy', 0.111), ('pitch', 0.091), ('hovering', 0.082), ('maneuver', 0.082), ('maneuvers', 0.082), ('hover', 0.072), ('egasus', 0.071), ('trajectory', 0.064), ('helicopters', 0.061), ('controller', 0.058), ('control', 0.057), ('ight', 0.053), ('ies', 0.049), ('autonomous', 0.047), ('collective', 0.047), ('tail', 0.043), ('climb', 0.041), ('seconds', 0.039), ('velocity', 0.039), ('backwards', 0.039), ('state', 0.037), ('utilities', 0.036), ('position', 0.035), ('reward', 0.033), ('drift', 0.032), ('pilot', 0.032), ('reinforcement', 0.031), ('blades', 0.031), ('sideways', 0.031), ('thrust', 0.031), ('xdot', 0.031), ('yamaha', 0.031), ('competition', 0.028), ('dynamics', 0.027), ('symmetries', 0.027), ('rad', 0.027), ('rc', 0.026), ('regression', 0.025), ('trajectories', 0.025), ('body', 0.025), ('idealized', 0.024), ('coordinates', 0.022), ('actual', 0.022), ('cyclic', 0.021), ('ama', 0.02), ('clockwise', 0.02), ('counteract', 0.02), ('flying', 0.02), ('ins', 0.02), ('pilots', 0.02), ('stably', 0.02), ('air', 0.02), ('facing', 0.02), ('simulator', 0.02), ('evaluations', 0.02), ('learned', 0.02), ('monte', 0.02), ('carlo', 0.02), ('accelerate', 0.019), ('challenging', 0.019), ('weights', 0.018), ('lines', 0.018), ('tilt', 0.018), ('downwards', 0.018), ('pirouette', 0.018), ('wrapping', 0.018), ('plane', 0.017), ('keeping', 0.017), ('class', 0.017), ('main', 0.017), ('locally', 0.017), ('forward', 0.017), ('controls', 0.017), ('mdp', 0.016), ('roll', 0.016), ('angled', 0.016), ('videos', 0.016), ('predicted', 0.016), ('andrew', 0.016), ('search', 0.016), ('triangle', 0.015), ('upward', 0.015), ('forwards', 0.015), ('began', 0.015), ('intercept', 0.015), ('solid', 0.015), ('slowly', 0.015), ('vertical', 0.015), ('human', 0.014), ('actions', 0.014), ('carries', 0.014), ('shaping', 0.014), ('velocities', 0.014), ('diagrams', 0.014), ('tunable', 0.014), ('policies', 0.014), ('successful', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
2 0.094253302 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.087515332 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.063874774 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
Author: Alan Fern, Sungwook Yoon, Robert Givan
Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1
5 0.061576273 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
6 0.060139794 42 nips-2003-Bounded Finite State Controllers
7 0.053475551 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
8 0.050555062 55 nips-2003-Distributed Optimization in Adaptive Networks
9 0.045062061 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
10 0.043611933 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
11 0.040693235 62 nips-2003-Envelope-based Planning in Relational MDPs
12 0.034752283 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
13 0.033500414 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
14 0.032227792 68 nips-2003-Eye Movements for Reward Maximization
15 0.031894997 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
16 0.030881623 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
17 0.02855441 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
18 0.02587671 181 nips-2003-Statistical Debugging of Sampled Programs
19 0.024523553 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
20 0.024344014 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination
topicId topicWeight
[(0, -0.091), (1, 0.116), (2, -0.037), (3, 0.004), (4, -0.035), (5, 0.022), (6, -0.03), (7, -0.056), (8, 0.056), (9, 0.024), (10, -0.036), (11, -0.007), (12, 0.018), (13, 0.03), (14, -0.023), (15, 0.011), (16, -0.009), (17, -0.032), (18, -0.023), (19, -0.023), (20, 0.01), (21, -0.012), (22, 0.053), (23, 0.013), (24, -0.109), (25, -0.038), (26, 0.039), (27, -0.031), (28, 0.027), (29, -0.028), (30, -0.057), (31, 0.013), (32, 0.016), (33, 0.028), (34, -0.031), (35, 0.011), (36, -0.005), (37, 0.024), (38, -0.026), (39, -0.001), (40, -0.005), (41, 0.032), (42, -0.019), (43, -0.001), (44, 0.001), (45, -0.155), (46, -0.004), (47, 0.055), (48, -0.003), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.9083907 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
2 0.76836663 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
3 0.62640595 55 nips-2003-Distributed Optimization in Adaptive Networks
Author: Ciamac C. Moallemi, Benjamin V. Roy
Abstract: We develop a protocol for optimizing dynamic behavior of a network of simple electronic components, such as a sensor network, an ad hoc network of mobile devices, or a network of communication switches. This protocol requires only local communication and simple computations which are distributed among devices. The protocol is scalable to large networks. As a motivating example, we discuss a problem involving optimization of power consumption, delay, and buffer overflow in a sensor network. Our approach builds on policy gradient methods for optimization of Markov decision processes. The protocol can be viewed as an extension of policy gradient methods to a context involving a team of agents optimizing aggregate performance through asynchronous distributed communication and computation. We establish that the dynamics of the protocol approximate the solution to an ordinary differential equation that follows the gradient of the performance objective. 1
4 0.61024195 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
5 0.59716505 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
Author: Arnab Nilim, Laurent El Ghaoui
Abstract: Optimal solutions to Markov Decision Problems (MDPs) are very sensitive with respect to the state transition probabilities. In many practical problems, the estimation of those probabilities is far from accurate. Hence, estimation errors are limiting factors in applying MDPs to realworld problems. We propose an algorithm for solving finite-state and finite-action MDPs, where the solution is guaranteed to be robust with respect to estimation errors on the state transition probabilities. Our algorithm involves a statistically accurate yet numerically efficient representation of uncertainty, via Kullback-Leibler divergence bounds. The worst-case complexity of the robust algorithm is the same as the original Bellman recursion. Hence, robustness can be added at practically no extra computing cost.
6 0.59188849 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
7 0.53206396 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
8 0.52316666 42 nips-2003-Bounded Finite State Controllers
9 0.40855405 62 nips-2003-Envelope-based Planning in Relational MDPs
10 0.38915506 68 nips-2003-Eye Movements for Reward Maximization
11 0.38175824 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
12 0.3578555 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
13 0.32717365 168 nips-2003-Salient Boundary Detection using Ratio Contour
14 0.32278609 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
15 0.31597748 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
16 0.29059941 75 nips-2003-From Algorithmic to Subjective Randomness
17 0.28480023 123 nips-2003-Markov Models for Automated ECG Interval Analysis
18 0.26885936 146 nips-2003-Online Learning of Non-stationary Sequences
19 0.24330266 161 nips-2003-Probabilistic Inference in Human Sensorimotor Processing
20 0.23811515 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring
topicId topicWeight
[(0, 0.043), (11, 0.023), (29, 0.026), (30, 0.016), (35, 0.044), (53, 0.079), (69, 0.011), (71, 0.052), (76, 0.036), (82, 0.024), (85, 0.068), (91, 0.092), (92, 0.312), (99, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.75418377 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
2 0.46917766 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.46430752 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
4 0.46134979 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
5 0.45818099 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
Author: Milos Hauskrecht, Branislav Kveton
Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1
6 0.45720908 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
7 0.45662409 113 nips-2003-Learning with Local and Global Consistency
8 0.45627189 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
9 0.45562008 107 nips-2003-Learning Spectral Clustering
10 0.45554453 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
11 0.45551354 30 nips-2003-Approximability of Probability Distributions
12 0.45533982 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
13 0.45281109 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
14 0.45251793 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
15 0.4524323 126 nips-2003-Measure Based Regularization
16 0.45230559 189 nips-2003-Tree-structured Approximations by Expectation Propagation
17 0.45185807 143 nips-2003-On the Dynamics of Boosting
18 0.45065102 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons
19 0.45024547 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
20 0.44960836 181 nips-2003-Statistical Debugging of Sampled Programs