nips nips2000 nips2000-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Morimoto, Kenji Doya
Abstract: This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. [sent-7, score-0.809]
2 The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. [sent-8, score-0.148]
3 However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. [sent-9, score-0.032]
4 Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. [sent-10, score-0.922]
5 The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. [sent-11, score-0.297]
6 We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. [sent-12, score-0.976]
7 We tested the paradigm, which we call "Robust Reinforcement Learning (RRL)," in the task of inverted pendulum. [sent-13, score-0.101]
8 In the linear domain, the policy and the value function learned by the on-line algorithms coincided with those derived analytically by the linear H ootheory. [sent-14, score-0.227]
9 For a fully nonlinear swingup task, the control by RRL achieved robust performance against changes in the pendulum weight and friction while a standard RL control could not deal with such environmental changes. [sent-15, score-0.961]
10 1 Introduction In this study, we propose a new reinforcement learning paradigm that we call "Robust Reinforcement Learning (RRL). [sent-16, score-0.269]
11 " Plain, model-free reinforcement learning (RL) is desperately slow to be applied to on-line learning of real-world problems. [sent-17, score-0.229]
12 Thus the use of environmental models have been quite common both for on-line action planning [3] and for off-line learning by simulation [4]. [sent-18, score-0.178]
13 However, no model can be perfect and modeling errors can cause unpredictable results, sometimes worse than with no model at all. [sent-19, score-0.09]
14 In fact , robustness against model uncertainty has been the main subject of research in control community for the last twenty years and the result is formalized as the "'Hoo" control theory [6). [sent-20, score-0.277]
15 In general, a modeling error causes a deviation of the real system state from the state predicted by the model. [sent-21, score-0.097]
16 This can be re-interpreted as a disturbance to the model. [sent-22, score-0.497]
17 However, the problem is that the disturbance due to a modeling error can have a strong correlation and thus standard Gaussian assumption may not be valid. [sent-23, score-0.607]
18 The basic strategy to achieve robustness is to keep the sensitivity I of the feedback control loop against a disturbance input small enough so that any disturbance due to the modeling error can be suppressed if the gain of mapping from the state error to the disturbance is bounded by 1;'. [sent-24, score-1.871]
19 In the 'Hooparadigm, those 'disturbance-to-error' and 'error-to-disturbance' gains are measured by a max norms of the functional mappings in order to assure stability for any modes of disturbance. [sent-25, score-0.041]
20 In the following, we briefly introduce the 'Hoo paradigm and show that design of a robust controller can be achieved by finding a min-max solution of a value nmction, which is formulated as Hamilton-Jacobi-Isaacs (HJI) equation. [sent-26, score-0.672]
21 We then derive on-line algorithms for estimating the value functions and for simultaneously deriving the worst disturbance and the best control that, respectively, maximizes and minimizes the value function. [sent-27, score-0.968]
22 We test the validity of the algorithms first in a linear inverted pendulum task. [sent-28, score-0.382]
23 It is verified that the value function as well as the disturbance and control policies derived by the on-line algorithm coincides with the solution of Riccati equations given by 'Hootheory. [sent-29, score-0.77]
24 We then compare the performance of the robust RL algorithm with a standard model-based RL in a nonlinear task of pendulum swing-up [3). [sent-30, score-0.531]
25 It is shown that robust RL controller can accommodate changes in the weight and the friction of the pendulum, which a standard RL controller cannot cope with. [sent-31, score-0.968]
26 -j z(s) u(s) W~G z y u K (a) (b) Figure 1: (a) Generalized Plant and Controller, (b) Small Gain Theorem The standard 'Hoocontrol [6) deals with a system shown in Fig. [sent-34, score-0.063]
27 l(a), where G is the plant, K is the controller, u is the control input, y is the measurement available to the controller (in the following, we assume all the states are observable, i. [sent-35, score-0.4]
28 y = x), w is unknown disturbance, and z is the error output that is desired to be kept small. [sent-37, score-0.065]
29 In general, the controller K is designed to stabilize the closed loop system based on a model of the plant G. [sent-38, score-0.506]
30 However, when there is a discrepancy between the model and the actual plant dynamics, the feedback loop could be unstable. [sent-39, score-0.156]
31 The effect of modeling error can be equivalently represented as a disturbance w generated by an unknown mapping ~ of the plant output z, as shown in Fig. [sent-40, score-0.697]
32 The goal of 1(,,,control problem is to design a controller K that brings the error z to zero while minimizing the Hoonorm of the closed loop transfer function from the disturbance w to the output z (1) Here II • 112 denotes £2 norm and i7 denotes maximum singular value. [sent-42, score-0.986]
33 The small gain theorem assures that if IITzwiloo ~ 'Y, then the system shown in Fig. [sent-43, score-0.067]
34 l(b) will be stable for any stable mapping ~ : z f-t w with 11~1100 < ~. [sent-44, score-0.056]
35 1 Min-max Solution to HooProblem We consider a dynamical system x = f(x, u, w) . [sent-46, score-0.054]
36 Thus an optimal value function V* is defined as V* = minmax (00 (zT(t)z(t) _ 'Y2w T (t)w(t))dt. [sent-48, score-0.131]
37 w u (4) 10 The condition for the optimal value function is given by oV* 0= minmax[zT z - 'Y2W TW + ~ f(x, u, w)] u w (5) uX which is known as Hamilton-Jacobi-Isaacs (HJI) equation. [sent-49, score-0.06]
38 From (5), we can derive the optimal control output u op and the worst disturbance wop by solving OZT Z OU 3 oV of (x, u, w) _ 0 + ox OU - d an OZT Z oW _ 2 'YW T oV of (x, u, w) _ 0 + ox ow -. [sent-50, score-0.966]
39 (6) Robust Reinforcement Learning Here we consider a continuous-time formulation of reinforcement learning [3] with the system dynamics x = f(x, u) and the reward r(x, u). [sent-51, score-0.361]
40 The basic goal is to find a policy u = g(x) that maximizes the cumulative future reward ! [sent-52, score-0.207]
41 However, a particular policy that was optimized for a certain environment may perform badly when the environmental setting changes. [sent-54, score-0.192]
42 In order to assure robust performance under changing environment or unknown disturbance, we introduce the notion of worst disturbance in 1i control to the reinforcement learning paradigm. [sent-55, score-1.196]
43 In this framework, we consider an augmented reward q(t) = r(x(t), u(t)) + s(w(t)), (7) where s(w(t)) is an additional reward for withstanding a disturbing input, for example, s(w) = 'Y2w T w. [sent-56, score-0.208]
44 The augmented value function is then defined as V(x(t)) = 1 e- ' -;' q(x(s), u(s), w(s))ds. [sent-57, score-0.09]
45 (8) The optimal value function is given by the solution of a variant of HJI equation 1 - V*(x) T aV* = maxmin[r(x , u) + s(w) + ~ f(x, u, w)]. [sent-58, score-0.125]
46 ux U (9) W Note that we can not find appropriate policies (Le. [sent-59, score-0.077]
47 We use a function approximator to implement the value function V(x(t); v), where y is a parameter vector. [sent-62, score-0.092]
48 As in the standard continuous-time RL , we define eligibility trace for a parameter Vi as ei(s) = e- ' ;;' 8~jit)dt and up- J; date rule as ei(t) = -~ei(t) + 8~v~t) , where", is the time constant of the eligibility trace[3] . [sent-63, score-0.15]
49 We can then derive learning rule for value function approximator [3] as Vi = rJ8(t)ei(t), where rJ denotes the learning rate. [sent-64, score-0.202]
50 Note that we do not assume f(x = 0) = 0 because the error output z is generalized as the reward r(x , u) in RRL framework. [sent-65, score-0.154]
51 1 Actor-disturber-critic We propose actor-disturber-critic architecture by which we can implement robust RL in a model-free fashion as the actor-critic architecture[l]. [sent-67, score-0.182]
52 We define the policies of the actor and the disturber implemented as u(t) = Au(x(t); yU) + nu(t) and w(t) = Aw(x(t); y W) +nw(t), respectively, where Au(x(t); y U) and Aw(x(t); yW) are function approximators with parameter vectors, yU and yW, and nu(t) and nw(t) are noise terms for exploration. [sent-68, score-0.289]
53 The parameters of the actor and the disturber are updated by vr = rJu8(t)nu(t) aAu(;~~; (10) yU) t where rJu and rJw denote the learning rates. [sent-69, score-0.287]
54 2 Robust Policy by Value Gradient Now we assume that an input-Affine model of the system dynamics and quadratic models of the costs for the inputs are available as x r(x , u) = f(x) + gl(X)W + g2(X)U Q(x) - uTR(x)u, s(w) = 'Y2wTw. [sent-71, score-0.121]
55 In this case, we can derive the best action and the worst disturbance in reference to the value function V as u op 1 T = "2 R (X) -1 g2 (X)( 8V T Ox) (11) We can use the policy (11) using the value gradient ~~ derived from the value function approximator. [sent-72, score-1.068]
56 The dynamics of the pendulum is given by ml2jj = + mgl sin /9 + T, where /9 is the angle from the upright position , T is input torque, p, = 0. [sent-75, score-0.383]
57 1 Linear Case We first considered a linear problem in order to test if the value function and the policy learned by robust RL coincides with the analytic solution of 1icx:>control problem. [sent-82, score-0.458]
58 Thus we use a locally linearized dynamics near the unstable equilibrium point x = (0, O)T . [sent-83, score-0.089]
59 The matrices for the linear model are given by A= (~ ~~ ),B =(~, ),B =(~, ),Q= (~ ~ ),R=1. [sent-84, score-0.024]
60 (14) 1 The reward function is given by q( t) criteria, = 2. [sent-85, score-0.089]
61 2 = _xT Qx - u 2 + ,2W 2, where robustness The value function, V = _xT Px, is parameterized by a symmetric matrix P. [sent-87, score-0.129]
62 Note that we used pre-designed stabilizing controller as the initial setting of RRL controller for stable learning[2]. [sent-90, score-0.62]
63 1 Learning of the value function Here we used the policy by value gradient shown in section 3. [sent-93, score-0.212]
64 Figure 2(a) shows that each element of the vector p converged to the solution of the Ricatti equation (12). [sent-95, score-0.094]
65 2 Actor-disturber-critic Here we used robust RL implemented by the actor-disturber-critic shown in section 3. [sent-98, score-0.182]
66 In the linear case, the actor and the disturber are represented as the linear controllers, A,,(x; v") = v"x and Aw(x; VW) = vWx, respectively. [sent-100, score-0.301]
67 The actor and the disturber were almost converged to the policy in (13) which derived from the Ricatti equation (12) (Fig. [sent-101, score-0.403]
68 The dash-dotted lines show the solution of the Ricatti equation. [sent-106, score-0.036]
69 2 Applying Robust RL to a Non-linear Dynamics We consider non-linear dynamical system (11), where f(x) =( ~ sine _ ~e ) ,gt{x) = ( ~ ) ,g2(X) = ( ~ ) Q(x) = cos(e) - 1, R(x) = 0. [sent-108, score-0.054]
70 (15) From considering (7) and (15), the reward function is given by q(t) = cos(e) - 1 0. [sent-110, score-0.089]
71 For approximating the value function, we used Normalized Gaussian Network (NGnet)[3]. [sent-114, score-0.06]
72 Note that the input gain g(x) was also learned[3]. [sent-115, score-0.042]
73 3 shows the value functions acquired by robust RL and standard model-based RL[3]. [sent-117, score-0.309]
74 The value function acquired by robust RL has a shaper ridge (Fig. [sent-118, score-0.271]
75 3(a)) attracts swing up trajectories than that learned with standard RL. [sent-119, score-0.193]
76 In FigA, we compared the robustness between the robust RL and the standard RL. [sent-120, score-0.289]
77 Both robust RL controller and the standard RL controller learned to swing up and hold a pendulum with the weight m = 1. [sent-121, score-1.259]
78 The robust RL controller could successfully swing up pendulums with different weight m = 3. [sent-124, score-0.614]
79 This result showed robustness of the robust RL controller. [sent-127, score-0.279]
80 The standard RL controller could achieve the task in fewer swings for m = 1. [sent-128, score-0.361]
81 However, the standard RL controller could not swing up the pendulum with different weight and friction (FigA(b)). [sent-131, score-0.876]
82 001 th th (a) Robust RL (b) Standard RL Figure 3: Shape of the value function after 1000 learning trials with m = 1. [sent-134, score-0.119]
83 3 Figure 4: Swing up trajectories with pendulum with different weight and friction. [sent-152, score-0.344]
84 5 Conclusions In this study, we proposed new RL paradigm called "Robust Reinforcement Learning (RRL). [sent-154, score-0.074]
85 " We showed that RRL can learn analytic solution of the 1-l oo controller in the linearized inverted pendulum dynamics and also showed that RRL can deal with modeling error which standard RL can not deal with in the non-linear inverted pendulum swing-up simulation example. [sent-155, score-1.383]
86 We will apply RRL to more complex task like learning stand-up behavior[4]. [sent-156, score-0.061]
87 Neuronlike adaptive elements that can solve difficult learning control problems. [sent-164, score-0.172]
88 Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. [sent-185, score-0.161]
wordName wordTfidf (topN-words)
[('disturbance', 0.497), ('rl', 0.358), ('controller', 0.296), ('pendulum', 0.284), ('rrl', 0.237), ('robust', 0.182), ('reinforcement', 0.161), ('worst', 0.145), ('disturber', 0.142), ('friction', 0.122), ('figa', 0.118), ('hji', 0.118), ('actor', 0.111), ('control', 0.104), ('swing', 0.102), ('plant', 0.092), ('policy', 0.092), ('reward', 0.089), ('riccati', 0.082), ('paradigm', 0.074), ('kg', 0.074), ('inverted', 0.074), ('minmax', 0.071), ('nu', 0.071), ('ricatti', 0.071), ('robustness', 0.069), ('environmental', 0.068), ('loop', 0.064), ('ov', 0.061), ('value', 0.06), ('yu', 0.055), ('dynamics', 0.052), ('doya', 0.047), ('hikaridai', 0.047), ('jst', 0.047), ('kyoto', 0.047), ('morimoto', 0.047), ('nw', 0.047), ('ou', 0.047), ('ozt', 0.047), ('unpredictable', 0.047), ('upright', 0.047), ('action', 0.046), ('quadratic', 0.044), ('modeling', 0.043), ('ei', 0.043), ('aw', 0.043), ('gain', 0.042), ('derive', 0.042), ('assure', 0.041), ('eligibility', 0.041), ('ux', 0.041), ('vw', 0.041), ('ox', 0.038), ('standard', 0.038), ('coincides', 0.037), ('au', 0.037), ('linearized', 0.037), ('solution', 0.036), ('policies', 0.036), ('output', 0.036), ('norm', 0.035), ('learning', 0.034), ('yw', 0.034), ('ow', 0.034), ('best', 0.034), ('weight', 0.034), ('coefficient', 0.034), ('elements', 0.034), ('op', 0.032), ('game', 0.032), ('approximator', 0.032), ('zt', 0.032), ('environment', 0.032), ('japan', 0.03), ('trace', 0.03), ('px', 0.03), ('augmented', 0.03), ('tries', 0.03), ('simulation', 0.03), ('equation', 0.029), ('closed', 0.029), ('converged', 0.029), ('acquired', 0.029), ('dynamical', 0.029), ('error', 0.029), ('stable', 0.028), ('showed', 0.028), ('learned', 0.027), ('task', 0.027), ('criteria', 0.027), ('maximizes', 0.026), ('cos', 0.026), ('trajectories', 0.026), ('system', 0.025), ('deal', 0.025), ('trials', 0.025), ('agent', 0.025), ('linear', 0.024), ('formulated', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 113 nips-2000-Robust Reinforcement Learning
Author: Jun Morimoto, Kenji Doya
Abstract: This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call
2 0.12127224 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
Author: Jakob Carlström
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission control of self-similar call traffic is used to demonstrate the technique. The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem. 1
3 0.11751398 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
Author: Dirk Ormoneit, Peter W. Glynn
Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1
4 0.1021762 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
5 0.099452682 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
6 0.080194011 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
7 0.061035071 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
8 0.055963185 43 nips-2000-Dopamine Bonuses
9 0.055928502 105 nips-2000-Programmable Reinforcement Learning Agents
10 0.053994577 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
11 0.053350091 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
12 0.050218686 48 nips-2000-Exact Solutions to Time-Dependent MDPs
13 0.049464393 57 nips-2000-Four-legged Walking Gait Control Using a Neuromorphic Chip Interfaced to a Support Vector Learning Algorithm
14 0.047936179 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning
15 0.043869819 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
16 0.035968885 49 nips-2000-Explaining Away in Weight Space
17 0.034601923 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition
18 0.031822979 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities
19 0.031707432 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account
20 0.031630743 37 nips-2000-Convergence of Large Margin Separable Linear Classification
topicId topicWeight
[(0, 0.129), (1, -0.044), (2, 0.042), (3, -0.189), (4, -0.154), (5, -0.01), (6, -0.035), (7, -0.007), (8, -0.026), (9, -0.053), (10, -0.046), (11, 0.009), (12, -0.017), (13, -0.049), (14, 0.008), (15, -0.045), (16, -0.007), (17, 0.012), (18, 0.06), (19, -0.011), (20, 0.078), (21, -0.019), (22, -0.054), (23, 0.018), (24, -0.051), (25, -0.044), (26, -0.013), (27, -0.069), (28, 0.101), (29, 0.087), (30, 0.277), (31, -0.038), (32, -0.096), (33, -0.127), (34, -0.062), (35, -0.136), (36, 0.108), (37, 0.074), (38, 0.051), (39, -0.061), (40, 0.006), (41, 0.018), (42, 0.232), (43, -0.081), (44, 0.047), (45, 0.044), (46, -0.105), (47, -0.078), (48, -0.008), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.94856018 113 nips-2000-Robust Reinforcement Learning
Author: Jun Morimoto, Kenji Doya
Abstract: This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call
Author: Jakob Carlström
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission control of self-similar call traffic is used to demonstrate the technique. The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem. 1
3 0.62200171 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
Author: Dirk Ormoneit, Peter W. Glynn
Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1
4 0.4250603 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
5 0.39793745 48 nips-2000-Exact Solutions to Time-Dependent MDPs
Author: Justin A. Boyan, Michael L. Littman
Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1
7 0.37792197 43 nips-2000-Dopamine Bonuses
8 0.32641557 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
9 0.32576981 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
10 0.31513754 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
12 0.23229127 105 nips-2000-Programmable Reinforcement Learning Agents
13 0.21569529 25 nips-2000-Analysis of Bit Error Probability of Direct-Sequence CDMA Multiuser Demodulators
14 0.21501875 132 nips-2000-The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving
15 0.21472582 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech
16 0.21115242 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
17 0.20529953 49 nips-2000-Explaining Away in Weight Space
18 0.19521679 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization
19 0.1888366 34 nips-2000-Competition and Arbors in Ocular Dominance
20 0.18861458 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
topicId topicWeight
[(10, 0.037), (17, 0.063), (32, 0.013), (33, 0.024), (49, 0.446), (55, 0.016), (62, 0.078), (65, 0.021), (67, 0.044), (76, 0.027), (79, 0.018), (81, 0.017), (90, 0.046), (97, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.82844949 113 nips-2000-Robust Reinforcement Learning
Author: Jun Morimoto, Kenji Doya
Abstract: This paper proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both off-line learning by simulations and for on-line action planning. However, the difference between the model and the real environment can lead to unpredictable, often unwanted results. Based on the theory of H oocontrol, we consider a differential game in which a 'disturbing' agent (disturber) tries to make the worst possible disturbance while a 'control' agent (actor) tries to make the best control input. The problem is formulated as finding a minmax solution of a value function that takes into account the norm of the output deviation and the norm of the disturbance. We derive on-line learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call
2 0.65637505 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition
Author: George Saon, Mukund Padmanabhan
Abstract: We consider the problem of designing a linear transformation () E lRPx n, of rank p ~ n, which projects the features of a classifier x E lRn onto y = ()x E lRP such as to achieve minimum Bayes error (or probability of misclassification). Two avenues will be explored: the first is to maximize the ()-average divergence between the class densities and the second is to minimize the union Bhattacharyya bound in the range of (). While both approaches yield similar performance in practice, they outperform standard LDA features and show a 10% relative improvement in the word error rate over state-of-the-art cepstral features on a large vocabulary telephony speech recognition task.
3 0.2661038 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
4 0.26072055 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
5 0.26012757 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
6 0.25594109 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
7 0.25579175 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
8 0.25271949 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
9 0.25154546 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
10 0.25115329 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
11 0.25096357 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
12 0.25027144 92 nips-2000-Occam's Razor
13 0.24949235 80 nips-2000-Learning Switching Linear Models of Human Motion
14 0.24846822 74 nips-2000-Kernel Expansions with Unlabeled Examples
15 0.24824066 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
16 0.24653609 22 nips-2000-Algorithms for Non-negative Matrix Factorization
17 0.2460603 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics
18 0.24466342 138 nips-2000-The Use of Classifiers in Sequential Inference
19 0.24458958 105 nips-2000-Programmable Reinforcement Learning Agents
20 0.24389032 52 nips-2000-Fast Training of Support Vector Classifiers