nips nips2001 nips2001-161 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
Reference: text
sentIndex sentText sentNum sentScore
1 nl Abstract This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. [sent-6, score-0.204]
2 Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. [sent-7, score-0.476]
3 This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. [sent-8, score-0.365]
4 Among the more important challenges for RL are tasks where part of the state of the environment is hidden from the agent. [sent-10, score-0.286]
5 Such tasks are called non-Markovian tasks or Partially Observable Markov Decision Processes. [sent-11, score-0.2]
6 For instance, in a navigation task different positions in the environment may look the same, but one and the same action may lead to different next states or rewards. [sent-13, score-0.216]
7 However, it also makes it more difficult, because now the agent not only needs to learn the mapping from environmental states to actions, for optimal performance it usually needs to determine which environmental state it is in as well. [sent-15, score-0.596]
8 Most approaches to solving non-Markovian RL tasks have problems if there are long-term dependencies between relevant events. [sent-17, score-0.235]
9 An example of a long-term dependency problem is a maze navigation task where the only way to distinguish between two T -junctions that look identical is to remember an observation or action a long time before either T-junction. [sent-18, score-0.436]
10 Such a case prese~ts obvious problems for fixed size history window approaches [6], which attempt toresolve the hidden state by making the chosen action depend not only on the current observation, but also on a fixed number of the most recent observations and actions. [sent-19, score-0.48]
11 If the relevant piece of information to be remembered falls outside the history window, the agent cannot use it. [sent-20, score-0.455]
12 However, the system starts with zero history and increases the depth of the history window step by step. [sent-22, score-0.275]
13 Other approaches to non-Markovian tasks are based on learning Finite State Automata [2], recurrent neural networks (RNNs) [10, 11, 6], or on learning to set memory bits [9]. [sent-24, score-0.483]
14 The difficulty lies in discovering the correlation between a piece of information and the moment at which this information becomes relevant at a later time, given the distracting observations and actions between them. [sent-27, score-0.202]
15 This difficulty can be viewed as an instance of the general problem of learning long-term dependencies in timeseries data. [sent-28, score-0.251]
16 2 LSTM LSTM is a recently proposed recurrent neural network architecture, originally designed for supervised timeseries learning [5, 3]. [sent-35, score-0.336]
17 It is based on an analysis of the problems that conventional recurrent neural network learning algorithms, e. [sent-36, score-0.193]
18 backpropagation through time (BPTT) and real-time recurrent learning (RTRL), have when learning timeseries with long-term dependencies. [sent-38, score-0.328]
19 gates receive input from the timeseries and the other units in the network, and they learn to open and close access to the CECs at appropriate moments. [sent-45, score-0.452]
20 Access from the activations of the CECs to the output units (and possibly other units) of the network is regulated using multiplicative output gates. [sent-46, score-0.422]
21 Similar to the input gates, the output gates learn when the time is right to send the information stored in the CECs to the output side of the network. [sent-47, score-0.347]
22 A recent addition is forget gates [3], which learn to reset the activation of the CECs when the information stored in the CECs is no longer useful. [sent-48, score-0.308]
23 The combination of a CEC with its associated input, output, and forget gate is called a memory cell. [sent-49, score-0.295]
24 It is also possible for multiple CECs to be combined with only one input, output, and forget gate, in a so-called memory block. [sent-51, score-0.233]
25 More formally, the network's activations at each timestep yh, output unit activation yk, input gate activation yin, output gate activation y0'Ut, and forget gate activation yep is computed in the following standard way: t are computed as follows. [sent-53, score-0.891]
26 A standard hidden unit's activation (1) m where Wim is the weight of the connection from unit m to unit i. [sent-54, score-0.221]
27 In this paper, Ii is the standard logistic sigmoid function for all units except output units, for which it is the identity function. [sent-55, score-0.241]
28 , or the "state" of memory cell v :J cell output ,/' ,/' b. [sent-57, score-0.305]
29 The network's output units directly code for the Advantages of different actions. [sent-62, score-0.204]
30 The memory cell's output ycj is calculated by ycj (t) == youtj (t)h(sc~ (t)) (3) 3 where h is a logistic sigmoid function scaled to the range [-1, IJ. [sent-66, score-0.334]
31 At some or all timesteps of the timeseries, the output units of the network may make prediction' errors. [sent-68, score-0.473]
32 One way is to let the RNN learn a model of the environment, which learns to predict observations and rewards, and in this way learns to infer the environmental state at each point [6, IIJ. [sent-74, score-0.368]
33 The model-based system could then learn the mapping from (inferred) environmental states to actions as in the Markovian case, using standard techniques such as Q-learning [6, 2J, or by backpropagating through the frozen model to the controller [IIJ. [sent-76, score-0.307]
34 The state of the environment is approximated by the current observation, which is the input to the network, together with the recurrent activations in the network, which represent the agent's history. [sent-78, score-0.261]
35 The LSTM network's output units directly code for the Advantage values of different actions. [sent-88, score-0.204]
36 Output units associated with other actions than the executed one do not receive error signals. [sent-92, score-0.218]
37 (5 ) (t)eim(t, Wim K indicates the output unit associated with the executed action, a is a learning rate parameter, and A is a parameter determining how fast the eligibility trace decays. [sent-98, score-0.225]
38 Undirected exploration attempts to tryout actions in the same way in each environmental state. [sent-102, score-0.353]
39 However, in non-Markovian tasks, th~ agent initially does not know which environmental state it is in. [sent-103, score-0.413]
40 Part of the exploration must be aimed at discovering the environmental state structure. [sent-104, score-0.331]
41 This paper employs a directed exploration technique based on these ideas. [sent-107, score-0.232]
42 A separate multilayer feedforward neural network, with the same input as the LSTM network (representing the current observation) and one output unit yV, is trained concurrently with the LSTM network. [sent-108, score-0.214]
43 This amounts to attempting to identify which observations are G Figure 2: Long-term dependency T-maze with length of corridor N == 10. [sent-111, score-0.275]
44 At the starting position S the agent's observation indicates where the goal position G is in this episode. [sent-112, score-0.197]
45 This exploration technique has obvious similarities with the statistically more rigorous technique of Interval Estimation (see [12]), as well as with certain model-based approaches where exploration is greater when there is more uncertainty in the predictions of a model [11]. [sent-116, score-0.355]
46 The agent has four possible actions: move North, East, South, or West. [sent-120, score-0.275]
47 The agent must learn to move from the starting position at the beginning of the corridor to the T-junction. [sent-121, score-0.633]
48 However, the location of the goal depends on a "road sign" the agent has seen at the starting position. [sent-123, score-0.279]
49 If the agent takes the correct action at the T-junction, it receives a reward of 4. [sent-124, score-0.429]
50 During the episode, the agent receives a reward of -. [sent-128, score-0.3]
51 At the starting position, the observation is either 011 or 110, in the corridor the observation is 101, and at the T-junction the observation is 010. [sent-130, score-0.412]
52 The length of the corridor N was systematically varied from 5 to 70. [sent-131, score-0.195]
53 If the agent takes only optimal ac~ions to the T-junction, it must remember the observation from the starting position for N timesteps to determine the optimal action at the T-junction. [sent-133, score-0.786]
54 Note that the agent is not aided by experiences in whiGh there are shorter time lag dependencies. [sent-134, score-0.317]
55 In fact, the opposite is true~ Initially, it takes many more actions until even the T -junction is reached, and the experienced history is very variable from episode to episode. [sent-135, score-0.235]
56 The agent must first learn to reliably move to the T-junction. [sent-136, score-0.353]
57 Once this is accomplished, the agent will begin to experience more or less consistent and shortest possible histories of observations and actions, from which it can learn to extract the relevant piece of information. [sent-137, score-0.472]
58 The directed exploration mechanism is crucial in this regard: it learns to set exploration low in the corridor and high at the T-junction. [sent-138, score-0.632]
59 The LSTM network had 3 input units, 12 standard hidden units, 3 memory cells, and a == . [sent-139, score-0.327]
60 5 o \ \ -* ~ "" G---E) + - -+ ~ooooooX 5 10 15 20 25 1 (]) 30 40 50 N: lenQth of corridor OJ ~ 0. [sent-147, score-0.195]
61 5 LSTM Elman-BPTT Memory bits 60 ~ 70 5 10 15 20 25 30 40 50 N: lenQth of corridor 60 70 Figure 3: Results in noise-free T-maze task. [sent-148, score-0.249]
62 Right: Average number of timesteps until success as a function of N. [sent-150, score-0.224]
63 0 E ~ 2 \ \ \ "\, IG---E) + - -+ \ \ 5 10 15 20 25 ~ 00000 oX 30 40 50 N: lenQth of corridor LSTM Elman-BPTT Memory bits 70 5 10 15 20 25 30 40 50 N: lenQth of corridor 60 70 Figure 4: Results in noisy T-maze task. [sent-154, score-0.444]
64 Right: Average number of timesteps until success as a function of N. [sent-156, score-0.224]
65 The Elman network had 16 hidden units and 16 context units, and a == . [sent-161, score-0.283]
66 The second alternative is a table-based system extended with memory bits that are part of the observation, and that the controller can switch on and off [9]. [sent-163, score-0.237]
67 Because the task requires the agent to remember just one bit of information, this system had one memory bit, and a == . [sent-164, score-0.587]
68 A run was considered a success if the agent learned to take the correct action at the T-junction in over 80% of cases, using its stochastic action selection mechanism. [sent-169, score-0.533]
69 In practice, this corresponds to 100% correct action choices at the T-junction using greedy action selection, as well as optimal or near-optimal action choices leading to the T-junction. [sent-170, score-0.387]
70 Figure 3 shows the number of successful runs (out of 10) as a function of the length of the corridor N, for each of the three methods. [sent-171, score-0.238]
71 It also shows the average number of timesteps needed to reach success. [sent-172, score-0.194]
72 The reason why the memory bits system performs worst is probably that, in contrast with the other two, it does not explicitly compute the gradient of performance with respect to past events. [sent-176, score-0.237]
73 It is one thing to learn long-term dependencies in a noise-free task, it is quite another thing to do so in the presence of severe noise. [sent-180, score-0.208]
74 Now the observation in the corridor is aOb, where a and b are independent, uniformly distributed random values in the range [0, 1], generate online. [sent-182, score-0.256]
75 To allow for a fair comparison, the table-based memory bit system's observation was computed using Michie and Chambers's BOXES state aggregation mechanism (see [12]), partitioning each input dimension into three equal regions. [sent-185, score-0.39]
76 The memory bit system suffers most from the noise. [sent-187, score-0.231]
77 Most importantly, RL-LSTM again significantly outperforms the others, both in terms of the maximum time lag it can deal with, and in terms of the number of timesteps needed to learn the task. [sent-190, score-0.344]
78 It consists of a difficult variation of the classical pole balancing task. [sent-193, score-0.365]
79 In the pole balancing task, an agent must balance an inherently unstable pole, hinged to the top of a wheeled cart that travels along a track, by applying left and right forces to the cart. [sent-194, score-0.686]
80 First, as in [6], the agent cannot observe the state information corresponding to the cart velocity and pole angular velocity. [sent-197, score-0.638]
81 Second, the agent must learn to operate in two different modes. [sent-199, score-0.323]
82 In mode A, action 1 is left push and action 2 is right push. [sent-200, score-0.376]
83 In mode B, this is reversed: action 1 is right push and action 2 is left push. [sent-201, score-0.376]
84 The information which mode the agent is operating in is provided to the agent only for the first second of the episode. [sent-203, score-0.58]
85 After that, the corresponding input unit is set to zero and the agent must remember which mode it is in. [sent-204, score-0.475]
86 Obviously, failing to remember the mode leads to very poor performance. [sent-205, score-0.197]
87 The only reward signal is -1 if the pole falls past ¹12° or if the cart hits either end of the track. [sent-206, score-0.385]
88 Note that the agent must learn to remember the (discrete) mode information for an infinite amount of time if it is to learn to balance the pole indefinitely. [sent-207, score-0.911]
89 The LSTM network had 2 output units, 14 standard hidden units, and 6 memory cells. [sent-210, score-0.365]
90 It has 3 input units: one each for cart position and pole angle; and one for the mode of operation, set to zero after one second of simulated time (50 timesteps). [sent-211, score-0.533]
91 In this problem, directed exploration was not necessary, because in contrast to the T-mazes, imperfect policies lead to many different experiences with reward signals, and there is hidden state everywhere in the environment. [sent-217, score-0.425]
92 For a continuous problem like this, a table-based memory bit system is not suited very well, so a comparison was only made with the ElmanBPTT system, which had 16 hidden and context units and a == . [sent-218, score-0.439]
93 It only learned to balance the pole for the first 50 timesteps, when the mode information is available, thus failing to learn the long-term dependency. [sent-221, score-0.522]
94 However, RL-LSTM learned optimal performance in 2 out of 10 runs (after an average of 6,250,000 timesteps of learning). [sent-222, score-0.237]
95 After learning, these two agents were able to balance the pole indefinitely in both modes of operation. [sent-223, score-0.351]
96 In the other 8 runs, the agents still learned to balance the pole in both modes for hundreds or even thousands of timesteps (after an average of 8,095,000 timesteps of learning), thus showing that the mode information was remembered for long time lags. [sent-224, score-0.963]
97 In most cases, such an agent learns optimal performance for one mode, while achieving good but suboptimal performance in the other. [sent-225, score-0.324]
98 5 Conclusions The results presented in this paper suggest that reinforcement learning with Long Short-Term ~v1emory (RL-LSTI\,f) is a promising approach to solving non-:r-v1arkovi&t;~ RL tasks with long-term dependencies. [sent-226, score-0.224]
99 This was demonstrated in a T-maze task with minimal time lag dependencies of up to 70 timesteps, as well as in a nonMarkovian version of pole balancing where optimal performance requires remembering information indefinitely. [sent-227, score-0.503]
100 RL-LSTM's main power is derived from LSTM's property of constant error flow, but for good performance in RL tasks, the combination with Advantage(A) learning and directed exploration was crucial. [sent-228, score-0.27]
wordName wordTfidf (topN-words)
[('lstm', 0.49), ('pole', 0.265), ('agent', 0.245), ('cecs', 0.224), ('corridor', 0.195), ('timesteps', 0.194), ('exploration', 0.163), ('rl', 0.153), ('memory', 0.144), ('timeseries', 0.143), ('units', 0.133), ('action', 0.129), ('environmental', 0.105), ('eim', 0.102), ('tasks', 0.1), ('advantage', 0.095), ('mode', 0.09), ('history', 0.09), ('forget', 0.089), ('rnn', 0.089), ('reinforcement', 0.086), ('actions', 0.085), ('lenqth', 0.082), ('eligibility', 0.081), ('recurrent', 0.08), ('learn', 0.078), ('activation', 0.076), ('hidden', 0.075), ('network', 0.075), ('remember', 0.072), ('wim', 0.071), ('output', 0.071), ('dependencies', 0.07), ('directed', 0.069), ('gates', 0.065), ('lags', 0.065), ('timestep', 0.065), ('cart', 0.065), ('long', 0.064), ('state', 0.063), ('gate', 0.062), ('observation', 0.061), ('bptt', 0.061), ('leiden', 0.061), ('episode', 0.06), ('balancing', 0.057), ('window', 0.056), ('reward', 0.055), ('balance', 0.054), ('bits', 0.054), ('north', 0.053), ('position', 0.051), ('yv', 0.048), ('bit', 0.048), ('environment', 0.048), ('cell', 0.045), ('runs', 0.043), ('difficult', 0.043), ('piece', 0.043), ('lag', 0.043), ('learns', 0.042), ('dependency', 0.042), ('aggregation', 0.041), ('cec', 0.041), ('etd', 0.041), ('gers', 0.041), ('jong', 0.041), ('remembered', 0.041), ('rllstm', 0.041), ('rtrl', 0.041), ('schmidhuber', 0.041), ('shortterm', 0.041), ('ycj', 0.041), ('system', 0.039), ('task', 0.039), ('observations', 0.038), ('learning', 0.038), ('sigmoid', 0.037), ('suboptimal', 0.037), ('activations', 0.037), ('relevant', 0.036), ('failing', 0.035), ('regulated', 0.035), ('rnns', 0.035), ('unit', 0.035), ('starting', 0.034), ('input', 0.033), ('boxes', 0.032), ('vanish', 0.032), ('histories', 0.032), ('modes', 0.032), ('thing', 0.03), ('flow', 0.03), ('south', 0.03), ('rewards', 0.03), ('move', 0.03), ('success', 0.03), ('time', 0.029), ('approaches', 0.029), ('push', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
2 0.17954656 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.12414894 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
5 0.11540829 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
Author: Randall C. O'Reilly, R. S. Busby
Abstract: We present a model of binding of relationship information in a spatial domain (e.g., square above triangle) that uses low-order coarse-coded conjunctive representations instead of more popular temporal synchrony mechanisms. Supporters of temporal synchrony argue that conjunctive representations lack both efficiency (i.e., combinatorial numbers of units are required) and systematicity (i.e., the resulting representations are overly specific and thus do not support generalization to novel exemplars). To counter these claims, we show that our model: a) uses far fewer hidden units than the number of conjunctions represented, by using coarse-coded, distributed representations where each unit has a broad tuning curve through high-dimensional conjunction space, and b) is capable of considerable generalization to novel inputs.
6 0.10490558 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
7 0.10343158 126 nips-2001-Motivated Reinforcement Learning
8 0.10265043 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
9 0.10221072 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
10 0.10124508 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
11 0.091719881 148 nips-2001-Predictive Representations of State
12 0.086894318 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
13 0.08615867 13 nips-2001-A Natural Policy Gradient
14 0.08218877 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
15 0.081560463 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
16 0.077319182 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay
17 0.071700715 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
18 0.066504858 121 nips-2001-Model-Free Least-Squares Policy Iteration
19 0.064798631 59 nips-2001-Direct value-approximation for factored MDPs
20 0.060874995 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding
topicId topicWeight
[(0, -0.186), (1, -0.153), (2, 0.194), (3, 0.014), (4, -0.064), (5, 0.083), (6, -0.046), (7, -0.024), (8, -0.074), (9, -0.002), (10, -0.061), (11, 0.022), (12, -0.02), (13, 0.024), (14, 0.016), (15, 0.145), (16, 0.079), (17, -0.058), (18, 0.108), (19, -0.009), (20, 0.03), (21, -0.059), (22, -0.017), (23, 0.137), (24, -0.065), (25, 0.078), (26, 0.157), (27, 0.056), (28, 0.029), (29, -0.089), (30, 0.101), (31, 0.065), (32, -0.16), (33, -0.135), (34, -0.092), (35, -0.024), (36, 0.061), (37, -0.041), (38, -0.009), (39, 0.043), (40, 0.023), (41, -0.001), (42, 0.092), (43, 0.006), (44, 0.053), (45, -0.054), (46, -0.136), (47, -0.093), (48, 0.076), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.95606548 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
2 0.71003991 126 nips-2001-Motivated Reinforcement Learning
Author: Peter Dayan
Abstract: The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.
3 0.63968408 148 nips-2001-Predictive Representations of State
Author: Michael L. Littman, Richard S. Sutton
Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.
4 0.60440451 177 nips-2001-Switch Packet Arbitration via Queue-Learning
Author: Timothy X. Brown
Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
6 0.52479315 91 nips-2001-Improvisation and Learning
7 0.51821965 128 nips-2001-Multiagent Planning with Factored MDPs
8 0.50805074 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
9 0.50457287 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
10 0.49360767 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding
11 0.49139562 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
12 0.48569831 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
13 0.4576335 40 nips-2001-Batch Value Function Approximation via Support Vectors
14 0.45674565 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
15 0.45333463 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
16 0.44869548 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
17 0.41582429 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
18 0.39067692 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
19 0.36848947 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
20 0.35165706 3 nips-2001-ACh, Uncertainty, and Cortical Inference
topicId topicWeight
[(14, 0.026), (17, 0.032), (19, 0.041), (22, 0.232), (27, 0.083), (30, 0.097), (38, 0.045), (58, 0.01), (59, 0.026), (72, 0.061), (79, 0.049), (83, 0.052), (91, 0.14)]
simIndex simValue paperId paperTitle
same-paper 1 0.83854687 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
Author: Bram Bakker
Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1
2 0.65560031 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
Author: Dieter Fox
Abstract: Over the last years, particle filters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efficiency of particle filters by adapting the size of sample sets on-the-fly. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle filter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle filters with fixed sample set sizes and over a previously introduced adaptation technique.
3 0.6544379 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
Author: Jonathan L. Shapiro, J. Wearden
Abstract: Animal data on delayed-reward conditioning experiments shows a striking property - the data for different time intervals collapses into a single curve when the data is scaled by the time interval. This is called the scalar property of interval timing. Here a simple model of a neural clock is presented and shown to give rise to the scalar property. The model is an accumulator consisting of noisy, linear spiking neurons. It is analytically tractable and contains only three parameters. When coupled with reinforcement learning it simulates peak procedure experiments, producing both the scalar property and the pattern of single trial covariances. 1
4 0.65327662 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
Author: Andrew D. Brown, Geoffrey E. Hinton
Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1
5 0.65182507 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
Author: M. Schmitt
Abstract: Recurrent neural networks of analog units are computers for realvalued functions. We study the time complexity of real computation in general recurrent neural networks. These have sigmoidal, linear, and product units of unlimited order as nodes and no restrictions on the weights. For networks operating in discrete time, we exhibit a family of functions with arbitrarily high complexity, and we derive almost tight bounds on the time required to compute these functions. Thus, evidence is given of the computational limitations that time-bounded analog recurrent neural networks are subject to. 1
6 0.65171003 182 nips-2001-The Fidelity of Local Ordinal Encoding
7 0.64955962 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
8 0.64910018 13 nips-2001-A Natural Policy Gradient
10 0.6471352 56 nips-2001-Convolution Kernels for Natural Language
11 0.6456362 169 nips-2001-Small-World Phenomena and the Dynamics of Information
12 0.64515448 46 nips-2001-Categorization by Learning and Combining Object Parts
13 0.64380866 149 nips-2001-Probabilistic Abstraction Hierarchies
14 0.64310914 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
15 0.64173841 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
16 0.64073062 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
17 0.6404742 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
18 0.64031518 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
19 0.64014852 121 nips-2001-Model-Free Least-Squares Policy Iteration
20 0.64009553 183 nips-2001-The Infinite Hidden Markov Model