nips nips2003 nips2003-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
Reference: text
sentIndex sentText sentNum sentScore
1 de ¡ Abstract We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. [sent-4, score-0.356]
2 We demonstrate how the GP model allows evaluation of the value function in closed form. [sent-5, score-0.171]
3 The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. [sent-6, score-0.66]
4 Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning. [sent-7, score-0.242]
5 For most non-trivial (nonminimum phase) control tasks a policy relying solely on short time rewards will fail. [sent-9, score-0.614]
6 In reinforcement learning this problem is explicitly recognized by the distinction between short-term (reward) and long-term (value) desiderata. [sent-10, score-0.16]
7 The Bellman equations are either solved iteratively by policy evaluation, or alternatively solved directly (the equations are linear) and commonly interleaved with policy improvement steps (policy iteration). [sent-13, score-1.083]
8 ¢ § While the concept of a value function is ubiquitous in reinforcement learning, this is not the case in the control community. [sent-15, score-0.278]
9 Alternatively, longer-term predictions can be achieved by concatenating short-term predictions, an approach which is made difficult by the fact that uncertainty in predictions typically grows (precluding approaches based on the certainty equivalence principle) as the time horizon lengthens. [sent-17, score-0.17]
10 (2003) n for a full probabilistic approach based on Gaussian processes; however, implementing a controller based on this approach requires numerically solving multivariate optimisation problems for every control action. [sent-19, score-0.197]
11 In contrast, having access to a value function makes computation of control actions much easier. [sent-20, score-0.174]
12 This approach can be naturally applied in discrete time, continuous state space systems. [sent-23, score-0.196]
13 This avoids the tedious discretisation of state spaces often required by other methods, eg. [sent-24, score-0.123]
14 In Dietterich and Wang (2002) kernel based methods (support vector regression) were also applied to learning of the value function, but in discrete state spaces. [sent-26, score-0.196]
15 In the current paper we use Gaussian process (GP) models for two distinct purposes: first to model the dynamics of the system (actually, we use one GP per dimension of the state space) which we will refer to as the dynamics GP and secondly the value GP for representing the value function. [sent-27, score-0.711]
16 When computing the values, we explicitly take the uncertainties from the dynamics GP into account, and using the linearity of the GP, we are able to solve directly for the value function, avoiding slow policy evaluation iterations. [sent-28, score-0.87]
17 For these experiments we use a greedy policy wrt. [sent-30, score-0.524]
18 However, since our representation of the value function is stochastic, we could represent uncertainty about values enabling a principled attack of the exploration vs. [sent-32, score-0.178]
19 The transition probabilities may include two sources of stochasticity: uncertainty in the model of the dynamics and stochasticity in the dynamics itself. [sent-37, score-0.47]
20 1 Gaussian Process Regression Models In GP models we put a prior directly on functions and condition on observations to make predictions (see Williams and Rasmussen (1996) for details). [sent-39, score-0.13]
21 ¢ § £ , we use a separate Gaussian Given a set of -dimensional observations of the form process model for predicting each coordinate of the system dynamics. [sent-47, score-0.143]
22 The inputs to each model are the state and action pair , the output is a (Gaussian) distribution over the consecutive state variable, using eq. [sent-48, score-0.373]
23 Combining the predictive models we obtain a multivariate Gaussian distribution over the consecutive state: the transition probabilities . [sent-50, score-0.206]
24 3 Policy Evaluation © 8¢ § ¤ We now turn towards the problem of evaluating for a given policy over the continuous state space. [sent-53, score-0.644]
25 In policy evaluation the Bellman equations are used as update rules. [sent-54, score-0.589]
26 In order to apply this approach in the continuous case, we have to solve the two integrals in eq. [sent-55, score-0.16]
27 polynomial or Gaussian) reward functions we can directly compute 1 the first Gaussian integral of eq. [sent-58, score-0.186]
28 Thus, the expected immediate reward, from state , following is: ¢ R© D$ eX 8©¨8 DV e § a © © ¥ ( & 0$ $ " #! [sent-60, score-0.163]
29 0' ¢) ¤ where (7) 7 ¢ § ¢ '& $ 0$ £¥$ 3 ¤ '$ 0$ ¦¥ $ & ¢ ( ¡ in which the mean and covariance for the consecutive state are coordinate-wise given by eq. [sent-61, score-0.263]
30 (3) involves an expectation over the value function, which is modeled by the value GP as a function of the states. [sent-64, score-0.146]
31 We need access to the value function at every point in the continuous state space, but we only explicitly represent values at a finite number of support points, and let the GP generalise to the entire space. [sent-65, score-0.354]
32 For a Gaussian covariance function3 this can be done in closed form as shown by Girard et al. [sent-68, score-0.116]
33 In detail, the Bellman equation for the value at support point is: ¡ '! [sent-70, score-0.197]
34 Note, that this equation implies a consistency between the value at the support points with the values at all other points. [sent-72, score-0.27]
35 Equation (8) could be used for iterative policy evaluation. [sent-73, score-0.476]
36 (8) is a set of linear simultaneous equations in , which we can solve4 explicitly: ( 9 ( V aV T 3 FT 1 ) 05 I 4 ) ( B C V T 3 ( 1 ) A5 ) Q %© & ¢ 8¨¨ © V ¢ © § ¤ © © © § ¤§ @ © 1 3 4 ( ( 1 (10) For more complex reward functions we may approximate it using eg. [sent-75, score-0.232]
37 2 T RU P I G E 0SQHFD ©¡ ¢@ § The computational cost of solving this system is , which is no more expensive than doing iterative policy evaluation, and equal to the cost of value GP prediction. [sent-79, score-0.591]
38 4 Policy Improvement Above we demonstrated how to compute the value function for a given policy . [sent-81, score-0.549]
39 Now given a value function we can act greedily, thereby defining an implicit policy: @ 7 ¢ § A 7© ¢ § ¤ 64 & $ 1$ 3 & $ 1$ ( 5 0 2 0 (11) ¡ G ¥$ ¤ ¢ §¦ £ © ¨¦¥©" ¤¢ § giving rise to one-dimensional optimisation problems (when the possible actions are scalar). [sent-82, score-0.232]
40 As above we can solve the relevant integrals and in addition compute derivatives wrt. [sent-83, score-0.115]
41 5 The Policy Iteration Algorithm We now combine policy evaluation and policy improvement into policy iteration in which both steps alternate until a stable configuration is reached5. [sent-87, score-1.595]
42 Thus given observations of system dynamics and a reward function we can compute a continuous value function and thereby an implicitly defined policy. [sent-88, score-0.518]
43 Given: observations of system dynamics of the form for a fixed time interval , discount factor and reward function 2. [sent-90, score-0.468]
44 Model Identification: Model the system dynamics by Gaussian processes for each state coordinate and combine them to obtain a model 3. [sent-91, score-0.377]
45 Initialise Value Function: Choose a set of support points and . [sent-92, score-0.129]
46 Fit Gaussian process hyperparameters for representing initialize using conjugate gradient optimisation of the marginal likelihood and set to a small positive constant. [sent-93, score-0.197]
47 Compute using the dynamics Gaussian processes Solve equation (7) in order to obtain Compute ’th row of as in equation (9) end for © ¢ § ¤ @ ¦& ¢ ¨8¨ V ¢ 6% ¡ ©©© © ¢ § ' ¥ ( 0 & $ 1$ &'$ '$ & 3 7© ! [sent-96, score-0.286]
48 ( ( V T © VFT 1 3 ) 05 I 4 § 9 £ ( Update Gaussian process hyperparameter for representing until stabilisation of The selection of the support points remains to be determined. [sent-98, score-0.185]
49 When using the algorithm in an online setting, support points could naturally be chosen as the states visited, possibly selecting the ones which conveyed most new information about the system. [sent-99, score-0.229]
50 In the experimental section, for simplicity of exposition we consider only the batch case, and use simply a regular grid of support points. [sent-100, score-0.223]
51 We have assumed for simplicity that the reward function is deterministic and known, but it would not be too difficult to also use a (GP) model for the rewards; any model that allows 5 Assuming convergence, which we have not proven. [sent-101, score-0.188]
52 6 1 (b) Figure 1: Figure (a) illustrates the mountain car problem. [sent-106, score-0.409]
53 The car is initially standing motionless at and the goal is to bring it up and hold it in the region such that . [sent-107, score-0.434]
54 The hatched area marks the target region and below the approximation by a Gaussian is shown (both projected onto the axis). [sent-108, score-0.147]
55 Figure (b) shows the position of the car is when controlled according to (11) using the approximated value function after 6 policy improvements shown in Figure 3. [sent-109, score-0.884]
56 The car reaches the target region due to uncertainty in the in about five time steps but does not end up exactly at dynamics model. [sent-110, score-0.733]
57 Similarly the greedy policy has been assumed, but generalisation to stochastic policies would not be difficult. [sent-114, score-0.56]
58 3 Illustrative Example For reasons of presentability of the value function, we below consider the well-known mountain car problem “park on the hill”, as described by Moore and Atkeson (1995) where the state-space is only two-dimensional. [sent-115, score-0.482]
59 The setting depicted in Figure 1(a) consists of a frictionless, point-like, unit mass car on a hilly landscape described by © B #B C (12) ! [sent-116, score-0.374]
60 " 4 for for D V E E © § ¢ I © P§ is described by the position of the car and its speed The state of the system which are constrained to and respectively. [sent-117, score-0.5]
61 As action a horizontal force in the range can be applied in order to bring the car up into the target and . [sent-118, score-0.452]
62 © B I § ¢ B ¡ 3 B 8¨8 E 9 " © ¢ § © © © For system identification we draw samples uniformly from their respective admissible regions and simulate time steps of seconds 6 forward in time using an ODE solver in order to get the consecutive states . [sent-121, score-0.3]
63 We then use two Gaussian processes to build a model to predict the system behavior from these examples for the two state variables independently using covariance functions of type eq. [sent-122, score-0.299]
64 Our algorithm works equally well for shorter time steps ( should be increased); for even longer time steps, modeling of the dynamics gets more complicated, and eventually for large enough control is no-longer possible. [sent-125, score-0.325]
65 6 1 x (b) −2 (c) Figure 2: Figures (a-c) show the estimated value function for the mountain car example after initialisation (a), after the first iteration over (b) and a nearly stabilised value function after 3 iterations (c). [sent-135, score-0.616]
66 § Y© B B B B 8E B Y B© B Having a model of the system dynamics, the other necessary element to provide to the proposed algorithm is a reward function. [sent-137, score-0.202]
67 In the formulation by Moore and Atkeson (1995) the reward is equal to if the car is in the target region and elsewhere. [sent-138, score-0.603]
68 For convenience we approximate this cube by a Gaussian proportional to with maximum reward as indicated in Figure 1(a). [sent-139, score-0.16]
69 States outside the feasible region are assigned zero value and reward. [sent-143, score-0.139]
70 # © B ¡§ ¢ B B E ' © % § E % As support points for the value function we simply put a regular grid onto the state-space and initialise the value function with the immediate rewards for these states, Figure 2(a). [sent-145, score-0.403]
71 The standard deviation of the noise of the value GP representing is set to , and the discount factor to . [sent-146, score-0.141]
72 Following the policy iteration algorithm we estimate the value of all support points following the implicit policy (11) wrt. [sent-147, score-1.28]
73 We then evaluate this policy and obtain an updated value function shown in Figure 2(b) where all points which can expect to reach the reward region in one time step gain value. [sent-149, score-0.874]
74 If we iterate this procedure two times we obtain a value function as shown in Figure 2(c) in which the state space is already well organised. [sent-150, score-0.196]
75 After five policy iterations the value function and therefore the implicit policy is stable, Figure 3(a). [sent-151, score-1.09]
76 In Figure 3(b) a dynamic GP based state-transition diagram is shown, in which each support point is connected to its predicted (mean) consecutive state when following the implicit policy. [sent-152, score-0.427]
77 For some of the support points the model correctly predicts that the car will leave the feasible region, no matter what is applied, which corresponds to areas with zero value in Figure 3(a). [sent-153, score-0.537]
78 © B I § ¢ B If we control the car from according to the found policy the car gathers momentum by first accelerating left before driving up into the target region where it is balanced as illustrated in Figure 1(b). [sent-155, score-1.325]
79 It shows that the random examples of the system dynamics are sufficient for this task. [sent-156, score-0.201]
80 The control policy found is probably very close to the optimally achievable. [sent-157, score-0.547]
81 5 x (a) (b) Figure 3: Figures (a) shows the estimated value function after 6 policy improvements (subsequent to Figures 2(a-c)) over where has stabilised. [sent-169, score-0.549]
82 Figure (b) is the corresponding state transition diagram illustrating the implicit policy on the support points. [sent-170, score-0.835]
83 The black lines connect and the respective estimated by the dynamics GP when following the implicit greedy policy with respect to (a). [sent-171, score-0.748]
84 The thick line marks the trajectory of the car for the movement described in Figure 1(b) based on the physics of the system. [sent-172, score-0.374]
85 The distribution over future reward could have small or large variance and identical means, two fairly different situations, that are treated identically when only the value expectation is considered. [sent-176, score-0.233]
86 The GP representation of value functions proposed here lends itself naturally to this more elaborate concept of value. [sent-181, score-0.127]
87 Implementation of this would require a second set of Bellman-like equations for the second moment of the values at the support points. [sent-183, score-0.133]
88 These equations would simply express consistency of uncertainty: the uncertainty of a value should be consistent with the uncertainty when following the policy. [sent-184, score-0.278]
89 The values at the support points would be (Gaussian) distributions with individual variances, which is readily handled by using a full diagonal noise term in the place of in eq. [sent-185, score-0.129]
90 However, iteration would be n necessary to solve the combined system, as there would be no linearity corresponding to eq. [sent-189, score-0.13]
91 Instead, a stochastic policy should be used, which should not cause further computational problems as long as it is represented by a Gaussian (or perhaps more appropriately a mixture of Gaussians). [sent-194, score-0.476]
92 A good policy should actively explore regions where we may gain a lot of information, requiring the notion of the value of information (Dearden et al. [sent-195, score-0.608]
93 Since the information gain would come from a better dynamics GP model, it may not be an easy task in practice to optimise jointly information and value. [sent-197, score-0.214]
94 We have introduced Gaussian process models into continuous-state reinforcement learning tasks, to model the state dynamics and the value function. [sent-198, score-0.543]
95 We believe that the good generalisation properties, and the simplicity of manipulation of GP models make them ideal candidates for these tasks. [sent-199, score-0.12]
96 We are convinced that recent developments in powerful kernel-based probabilistic models for supervised learning such as GPs, will integrate well into reinforcement learning and control. [sent-206, score-0.16]
97 Both the modeling and analytic properties make them excellent candidates for reinforcement learning tasks. [sent-207, score-0.164]
98 We speculate that their fully probabilistic nature offers promising prospects for some fundamental problems of reinforcement learning. [sent-208, score-0.19]
99 The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. [sent-242, score-0.134]
100 Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. [sent-250, score-0.132]
wordName wordTfidf (topN-words)
[('gp', 0.481), ('policy', 0.476), ('car', 0.335), ('reward', 0.16), ('dynamics', 0.159), ('reinforcement', 0.134), ('state', 0.123), ('qui', 0.113), ('rasmussen', 0.103), ('dearden', 0.098), ('atkeson', 0.089), ('support', 0.087), ('gaussian', 0.087), ('bellman', 0.086), ('consecutive', 0.085), ('girard', 0.085), ('integrals', 0.078), ('batch', 0.075), ('moore', 0.075), ('mountain', 0.074), ('value', 0.073), ('df', 0.071), ('control', 0.071), ('evaluation', 0.067), ('region', 0.066), ('implicit', 0.065), ('uncertainty', 0.064), ('optimisation', 0.064), ('iteration', 0.061), ('dx', 0.061), ('speculate', 0.056), ('covariance', 0.055), ('processes', 0.053), ('ft', 0.051), ('hyperparameters', 0.051), ('initialise', 0.049), ('attias', 0.049), ('greedy', 0.048), ('equations', 0.046), ('continuous', 0.045), ('stochasticity', 0.045), ('states', 0.044), ('transition', 0.043), ('action', 0.042), ('figures', 0.042), ('williams', 0.042), ('system', 0.042), ('points', 0.042), ('target', 0.042), ('barto', 0.042), ('ahead', 0.042), ('diagram', 0.041), ('exploration', 0.041), ('immediate', 0.04), ('discount', 0.04), ('predictions', 0.039), ('landscape', 0.039), ('marks', 0.039), ('rewards', 0.039), ('observations', 0.039), ('steps', 0.039), ('solve', 0.037), ('equation', 0.037), ('controller', 0.036), ('generalisation', 0.036), ('predicting', 0.034), ('admissible', 0.034), ('exploitation', 0.033), ('exposition', 0.033), ('bring', 0.033), ('linearity', 0.032), ('consistency', 0.031), ('closed', 0.031), ('db', 0.031), ('greedily', 0.031), ('et', 0.03), ('visited', 0.03), ('wang', 0.03), ('candidates', 0.03), ('actions', 0.03), ('gain', 0.029), ('sutton', 0.029), ('simplicity', 0.028), ('process', 0.028), ('tradeoff', 0.028), ('online', 0.028), ('time', 0.028), ('naturally', 0.028), ('representing', 0.028), ('dietterich', 0.027), ('explicitly', 0.026), ('models', 0.026), ('predicted', 0.026), ('functions', 0.026), ('jointly', 0.026), ('multivariate', 0.026), ('bayesian', 0.026), ('predictive', 0.026), ('conjugate', 0.026), ('temporary', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
2 0.3603428 194 nips-2003-Warped Gaussian Processes
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
3 0.3052043 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.29148892 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
Author: Christopher J. Paciorek, Mark J. Schervish
Abstract: We introduce a class of nonstationary covariance functions for Gaussian process (GP) regression. Nonstationary covariance functions allow the model to adapt to functions whose smoothness varies with the inputs. The class includes a nonstationary version of the Matérn stationary covariance, in which the differentiability of the regression function is controlled by a parameter, freeing one from fixing the differentiability in advance. In experiments, the nonstationary GP regression model performs well when the input space is two or three dimensions, outperforming a neural network model and Bayesian free-knot spline models, and competitive with a Bayesian neural network, but is outperformed in one dimension by a state-of-the-art Bayesian free-knot spline model. The model readily generalizes to non-Gaussian data. Use of computational methods for speeding GP fitting may allow for implementation of the method on larger datasets. 1
5 0.23243383 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
Author: Alan Fern, Sungwook Yoon, Robert Givan
Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1
6 0.22094963 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
7 0.17262597 42 nips-2003-Bounded Finite State Controllers
8 0.17086044 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
9 0.17008182 55 nips-2003-Distributed Optimization in Adaptive Networks
10 0.16735813 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
11 0.14801075 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
12 0.13961048 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
13 0.13565448 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
14 0.1334915 62 nips-2003-Envelope-based Planning in Relational MDPs
15 0.13142265 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
16 0.098320507 68 nips-2003-Eye Movements for Reward Maximization
17 0.098176688 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
18 0.094253302 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
19 0.082194149 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
20 0.076361716 26 nips-2003-An MDP-Based Approach to Online Mechanism Design
topicId topicWeight
[(0, -0.271), (1, 0.366), (2, -0.159), (3, 0.003), (4, -0.045), (5, 0.21), (6, -0.015), (7, -0.265), (8, 0.22), (9, 0.187), (10, -0.25), (11, 0.165), (12, 0.028), (13, -0.144), (14, -0.025), (15, -0.092), (16, -0.122), (17, 0.016), (18, -0.132), (19, -0.092), (20, -0.018), (21, -0.053), (22, 0.049), (23, -0.032), (24, -0.051), (25, 0.025), (26, 0.057), (27, -0.064), (28, 0.021), (29, 0.011), (30, 0.001), (31, 0.016), (32, 0.013), (33, 0.003), (34, -0.018), (35, 0.003), (36, 0.021), (37, 0.027), (38, 0.006), (39, -0.02), (40, 0.014), (41, 0.032), (42, -0.03), (43, -0.013), (44, -0.017), (45, -0.033), (46, -0.065), (47, -0.002), (48, 0.023), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.95610869 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
2 0.74273062 194 nips-2003-Warped Gaussian Processes
Author: Edward Snelson, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We generalise the Gaussian process (GP) framework for regression by learning a nonlinear transformation of the GP outputs. This allows for non-Gaussian processes and non-Gaussian noise. The learning algorithm chooses a nonlinear transformation such that transformed data is well-modelled by a GP. This can be seen as including a preprocessing transformation as an integral part of the probabilistic modelling problem, rather than as an ad-hoc step. We demonstrate on several real regression problems that learning the transformation can lead to significantly better performance than using a regular GP, or a GP with a fixed transformation. 1
3 0.7219494 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.68210858 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression
Author: Christopher J. Paciorek, Mark J. Schervish
Abstract: We introduce a class of nonstationary covariance functions for Gaussian process (GP) regression. Nonstationary covariance functions allow the model to adapt to functions whose smoothness varies with the inputs. The class includes a nonstationary version of the Matérn stationary covariance, in which the differentiability of the regression function is controlled by a parameter, freeing one from fixing the differentiability in advance. In experiments, the nonstationary GP regression model performs well when the input space is two or three dimensions, outperforming a neural network model and Bayesian free-knot spline models, and competitive with a Bayesian neural network, but is outperformed in one dimension by a state-of-the-art Bayesian free-knot spline model. The model readily generalizes to non-Gaussian data. Use of computational methods for speeding GP fitting may allow for implementation of the method on larger datasets. 1
5 0.6737861 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning
Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng
Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.
6 0.57628918 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
7 0.52862483 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
8 0.49847966 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
9 0.4782359 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
10 0.46427569 55 nips-2003-Distributed Optimization in Adaptive Networks
11 0.44447395 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
12 0.43548337 42 nips-2003-Bounded Finite State Controllers
13 0.4252176 62 nips-2003-Envelope-based Planning in Relational MDPs
14 0.40217406 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
15 0.38593876 68 nips-2003-Eye Movements for Reward Maximization
16 0.38105875 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
17 0.35539883 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales
18 0.33543009 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
19 0.33129689 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System
20 0.28945974 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
topicId topicWeight
[(0, 0.061), (11, 0.023), (29, 0.021), (30, 0.021), (35, 0.072), (48, 0.011), (53, 0.107), (69, 0.02), (71, 0.063), (76, 0.07), (85, 0.089), (87, 0.174), (91, 0.132), (99, 0.069)]
simIndex simValue paperId paperTitle
1 0.92548162 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
Author: Amit Gruber, Yair Weiss
Abstract: The problem of “Structure From Motion” is a central problem in vision: given the 2D locations of certain points we wish to recover the camera motion and the 3D coordinates of the points. Under simplified camera models, the problem reduces to factorizing a measurement matrix into the product of two low rank matrices. Each element of the measurement matrix contains the position of a point in a particular image. When all elements are observed, the problem can be solved trivially using SVD, but in any realistic situation many elements of the matrix are missing and the ones that are observed have a different directional uncertainty. Under these conditions, most existing factorization algorithms fail while human perception is relatively unchanged. In this paper we use the well known EM algorithm for factor analysis to perform factorization. This allows us to easily handle missing data and measurement uncertainty and more importantly allows us to place a prior on the temporal trajectory of the latent variables (the camera position). We show that incorporating this prior gives a significant improvement in performance in challenging image sequences. 1
same-paper 2 0.8831588 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
3 0.78402674 158 nips-2003-Policy Search by Dynamic Programming
Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng
Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1
4 0.77072299 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
Author: Milos Hauskrecht, Branislav Kveton
Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1
5 0.76952791 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
6 0.76846564 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
7 0.76737076 107 nips-2003-Learning Spectral Clustering
8 0.76330334 189 nips-2003-Tree-structured Approximations by Expectation Propagation
9 0.76273376 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
10 0.76081216 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
11 0.76018125 113 nips-2003-Learning with Local and Global Consistency
12 0.75851715 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
13 0.75806499 30 nips-2003-Approximability of Probability Distributions
14 0.75602436 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
15 0.75559461 55 nips-2003-Distributed Optimization in Adaptive Networks
16 0.75191331 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
17 0.75113094 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems
18 0.75097412 126 nips-2003-Measure Based Regularization
19 0.75047988 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
20 0.74932146 143 nips-2003-On the Dynamics of Boosting