nips nips2004 nips2004-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Françcois Rivest, Yoshua Bengio, John Kalaska
Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. [sent-9, score-0.243]
2 Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. [sent-11, score-1.437]
3 This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. [sent-12, score-0.346]
4 Reinforcement learning with non-linear function approximators like backpropagation networks attempt to address this problem, but in many cases have been demonstrated to be non-convergent [2]. [sent-15, score-0.121]
5 In parallel, recent work in neurophysiology shows that the basal ganglia can be modeled by an actor-critic version of temporal difference (TD) learning [4][5][6], a well-known reinforcement learning algorithm. [sent-17, score-1.061]
6 However, the basal ganglia do not, by themselves, solve the problem of finding complex features. [sent-18, score-0.738]
7 But the frontal cortex, which is known to play an important role in planning and decision-making, is tightly linked with the basal ganglia. [sent-19, score-0.564]
8 ca/~rivestfr This paper presents new algorithms based on current neurophysiological evidence about brain functional organization. [sent-24, score-0.099]
9 It tries to devise biologically plausible algorithms that may help overcome existing difficulties in machine reinforcement learning. [sent-25, score-0.284]
10 2 Biological Background The mammalian brain has multiple learning subsystems. [sent-28, score-0.107]
11 Major learning components include the neocortex, the hippocampal formation (explicit memory storage system), the cerebellum (adaptive control system) and the basal ganglia (reinforcement learning, also known as instrumental conditioning). [sent-29, score-0.784]
12 The cortex can be argued to be equipotent, meaning that, given the same input, any region can learn to perform the same computation. [sent-30, score-0.227]
13 Nevertheless, the frontal lobe differs by receiving a particularly prominent innervation of a specific type of neurotransmitter, namely dopamine. [sent-31, score-0.251]
14 The large frontal lobe in primates, and especially in humans, distinguishes them from lower mammals. [sent-32, score-0.251]
15 Other regions of the cortex have been modeled using unsupervised learning methods such as ICA [7], but models of learning in the frontal cortex are only beginning to emerge. [sent-33, score-0.794]
16 The frontal dopaminergic input arises in a part of the basal ganglia called ventral tegmental area (VTA) and the substantia nigra (SN). [sent-34, score-1.038]
17 The signal generated by dopaminergic (DA) neurons resembles the effective reinforcement signal of temporal difference (TD) learning algorithms [5][8]. [sent-35, score-0.544]
18 Another important part of the basal ganglia is the striatum. [sent-36, score-0.705]
19 Both receive input from the cortex (mostly frontal) and from the DA neurons, but the striosome projects principally to DA neurons in VTA and SN. [sent-38, score-0.436]
20 The striosome is hypothesized to act as a reward predictor, allowing the DA signal to compute the difference between the expected and received reward. [sent-39, score-0.255]
21 The matriosome projects back to the frontal lobe (for example, to the motor cortex). [sent-40, score-0.368]
22 Its hypothesized role is therefore in action selection [4][5][6]. [sent-41, score-0.095]
23 Although there have been several attempts to model the interactions between the frontal cortex and basal ganglia, little work has been done on learning in the frontal cortex. [sent-42, score-1.012]
24 In [9], an adaptive learning system based on the cerebellum and the basal ganglia is proposed. [sent-43, score-0.809]
25 In [10], a reinforcement learning model of the hippocampus is presented. [sent-44, score-0.279]
26 In this paper, we do not attempt to model neurophysiological data per se, but rather to develop, from current neurophysiological knowledge, new and efficient biologically plausible reinforcement learning algorithms. [sent-45, score-0.485]
27 The first layer (I) is the input layer, where activation represents the current state. [sent-47, score-0.353]
28 The second layer, the hidden layer (H), is responsible for finding the non-linear features necessary to solve the task. [sent-48, score-0.443]
29 Learning in this layer will vary from model to model. [sent-49, score-0.281]
30 Both the input and the hidden layer feed the parallel actor-critic layers (A and V) which are the computational analogs of the striatal matriosome and striosome, respectively. [sent-50, score-0.573]
31 The neurological literature reports an uplink from V and the reward to DA neurons which sends back the effective reinforcement signal e (dashed lines) to A, V and H. [sent-52, score-0.48]
32 The A action units usually feed into the motor cortex, which controls muscle activation. [sent-53, score-0.238]
33 The basal ganglia receive input mainly from the frontal cortex and the dopaminergic signal (e). [sent-55, score-1.37]
34 They also receive some input from parietal cortex (which, as opposed to the frontal cortex, does not receive DA input, and hence, may be unsupervised). [sent-56, score-0.538]
35 H will represent frontal cortex when given e and non-frontal cortex when not. [sent-57, score-0.646]
36 The weights W, v and U correspond to weights into the layers A, V and H respectively (e is not weighted). [sent-58, score-0.151]
37 reward A D Striatum V e W v (Frontal) Cortex H U Sensory input I Figure 1: Architecture of the models. [sent-59, score-0.126]
38 Let xt be the vector of the input layer activations based on the state of the environment at time t. [sent-60, score-0.451]
39 Let f be the sigmoidal activation function of hidden units in H. [sent-61, score-0.322]
40 Then yt = [f(u1xt ), …,f(unxt )] T, the vector of activations of the hidden layer at time t, and where ui is a row of the weight matrix U. [sent-62, score-0.539]
41 1 Actor-critic The actor-critic model of the basal ganglia developed here is derived from [4]. [sent-65, score-0.705]
42 It is very similar to the basal ganglia model in [5] which has been used to simulate neurophysiological data recorded while monkeys were learning a task [6]. [sent-66, score-0.824]
43 All units are linear weighted sums of activity from the previous layers. [sent-67, score-0.167]
44 Beginning with an overestimate of the expected reward leads every action to be negatively corrected, one after the other until the best one remains. [sent-71, score-0.161]
45 Let bt = Wzt be the vector of activation of the actor layer before the winner take all processing. [sent-74, score-0.515]
46 Let at = argmax(bt,i ) be the winning action index at time t, and let the vector ct be the activation of the layer A after the winner take all processing such that ct,a = 1 if a = at, 0 otherwise. [sent-75, score-0.483]
47 In order to do so, it updates V such that V ( zt −1 ) → E [rt + γV ( zt )] where rt is the reward at time t and γ the discount factor. [sent-79, score-0.255]
48 A learning rule for the weights v of V can then be devised by finding the gradient of E with respect to the weights v. [sent-81, score-0.3]
49 Thus, the gradient is given by ∂E = 2e t [γz t − z t −1 ] ∂v Adding a learning rate and negating the gradient for minimization gives the update: ∆v = αe t [z t −1 − γz t ] Developing a learning rule for the actor units and their weights W using a cost function is a bit more complex. [sent-83, score-0.595]
50 One approach is to use the tri-hebbian rule ∆W = αe t c t −1 z t −1 T Remark that only the row vector of weights of the winning action is modified. [sent-84, score-0.232]
51 If the reward is higher than expected (e > 0), than the action units activated by the previous state should be reinforced. [sent-87, score-0.281]
52 Conversely, if it is less than expected (e < 0), than the winning actor unit activity should be reduced for that state. [sent-88, score-0.225]
53 2 Biological justification [4] presented the first description of an actor-critic architecture based on data from the basal ganglia that resemble the one here. [sent-92, score-0.778]
54 The major difference is that the V update rule did not use the complete gradient information. [sent-93, score-0.167]
55 The model presented here is simpler and the critic update rule is basically the same, but justified neurologically. [sent-95, score-0.113]
56 Our model also has a more realistic actor update rule consistent with neurological knowledge of plasticity in the corticostriatal synapses [11] (H to V weights). [sent-96, score-0.317]
57 The main purpose of the model presented in [5] was to simulate dopaminergic activity for which V is the most important factor, and in this respect, it was very successful [6]. [sent-97, score-0.189]
58 2 Hidden Layer Because the reinforcement learning layer is linear, the hidden layer must learn the necessary non-linearity to solve the task. [sent-99, score-0.934]
59 The rules below are attempts at neurologically plausible learning rules for the cortex, assuming it has no clear supervision signal other than the DA signal for the frontal cortex. [sent-100, score-0.378]
60 All hidden units weight vectors are initialized randomly and scaled to norm 1 after each update. [sent-101, score-0.249]
61 The hidden layer is composed of randomly generated hidden units that are not trained. [sent-103, score-0.659]
62 • ICA In [7], the visual cortex was modeled by an ICA learning rule. [sent-104, score-0.282]
63 If the non-frontal cortex is equipotent, then any region of the cortex could be successfully modeled using such a generic rule. [sent-105, score-0.48]
64 The idea of combining unsupervised learning with reinforcement learning has already proven useful [1], but the unsupervised features were trained prior to the reinforcement training. [sent-106, score-0.558]
65 Here, the ICA rule from [13] will be used as the hidden layer. [sent-108, score-0.199]
66 This means that the hidden units are learning to reproduce the independent source signals at the origin of the observed mixed signal. [sent-109, score-0.278]
67 • Adaptive ICA (e-ICA) If H represents the frontal cortex, then an interesting variation of ICA is to multiply its update term by the DA signal e. [sent-110, score-0.295]
68 The size of e may act as an adaptive learning rate whose source is the reinforcement learning system critic. [sent-111, score-0.297]
69 Also, if the reward is less than expected (e < 0), the features learned by the ICA unit may be more counterproductive than helpful, and e pushes the learning away from those features. [sent-112, score-0.126]
70 • e-gradient method Another possible approach is to base the update rule on the derivative of the objective function E applied to the hidden layer weights U, but constraining the update rule to only use information available locally. [sent-113, score-0.693]
71 The cortex is unlikely to have this and might consider all the weights in v to be equal to some constant. [sent-115, score-0.284]
72 To avoid neurons all moving in the same direction uniformly, we encourage the units on the hidden layer to minimize their covariance. [sent-116, score-0.599]
73 Let qt be the average activity of the hidden units at time t, i. [sent-118, score-0.402]
74 1 The task: Acrobot The input was coded using 12 equidistant radial basis functions for each angle and 13 equidistant radial basis functions for each angular velocity, for a total of 50 nonnegative inputs. [sent-127, score-0.169]
75 A reward of 1 was given only when the final state was reached (in all other case, the reward of an action was 0). [sent-129, score-0.347]
76 Only 3 actions were available (3 actor units), either -1, 0 or 1 unit of torque. [sent-130, score-0.137]
77 50 networks with different random initialization where run for all models for 100 episodes (an episode is the sequence of steps the network performs to achieve the goal from the start position). [sent-132, score-0.563]
78 A number of learning rate values were tried for each model (actor-critic layer learning rate, and hidden layer learning rate). [sent-134, score-0.778]
79 The selected parameters were the ones for which the average number of steps per episode plus its standard deviation was the lowest. [sent-135, score-0.498]
80 All hidden layer models got a learning rate of 0. [sent-136, score-0.439]
81 Three variables were compared: overall learning performance (in number of steps to success per episode), final performance (number of steps on the last episode), and early learning performance (number of steps for the first episode). [sent-140, score-0.506]
82 1 Space under the learning curve Figure 3 shows the average steps per episode for each model in decreasing order. [sent-145, score-0.527]
83 All models needed fewer steps on average than baseline (which has no training at the hidden layer). [sent-146, score-0.299]
84 In order to assess the performance of the models, an ANOVA analysis of the average number of steps per episode over the 100 episodes was performed. [sent-147, score-0.615]
85 Scheffé post-hoc analysis revealed that the performance of every model was significantly different from every other, except for e-gradient and e-ICA (which are not significantly different from each other). [sent-148, score-0.186]
86 2 Final performance ANOVA analysis was also used to determine the final performance of the models, by comparing the number of steps on the last episode. [sent-151, score-0.208]
87 Figure 4 shows the results on the last episode in increasing order. [sent-153, score-0.377]
88 Under this measure, e-gradient(0) and e-ICA were significantly faster than the baseline and ICA was significantly slower (Figure 5). [sent-161, score-0.296]
89 This implies that the e signal could control the ICA learning to move synergistically with the reinforcement learning system. [sent-165, score-0.332]
90 In this setup, a neural network of 50 inputs, 50 hidden sigmoidal units, and 1 linear output was used as function approximator for V. [sent-168, score-0.159]
91 Although not different from the baseline on the first episode, it was significantly worst on overall and final performance, unable to constantly improve. [sent-172, score-0.258]
92 From available reports, their first trial could be slower than e-gradient(0) but they could reach better final performance after more than 100 episodes with a final average of 75 steps (after 500 episodes). [sent-178, score-0.423]
93 5 D i s c u s s i on In this paper we explored a new family of biologically plausible reinforcement learning algorithms inspired by models of the basal ganglia and the cortex. [sent-180, score-1.054]
94 They use a linear actor-critic model of the basal ganglia and were extended with a variety of unsupervised and partially supervised learning algorithms inspired by brain structures. [sent-181, score-0.845]
95 The results showed that pure unsupervised learning was slowing down learning and that a simple quasi-local rule at the hidden layer greatly improved performance. [sent-182, score-0.574]
96 It remains to test them on further tasks, and to compare them to more reinforcement learning algorithms. [sent-185, score-0.243]
97 Possible loops from the actor units to the hidden layer are also to be considered. [sent-186, score-0.667]
98 (2000) Policy gradient methods for reinforcement learning with function approximation. [sent-203, score-0.297]
99 (1999) A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. [sent-216, score-0.274]
100 (1999) What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? [sent-239, score-0.705]
wordName wordTfidf (topN-words)
[('basal', 0.372), ('episode', 0.352), ('ganglia', 0.333), ('layer', 0.281), ('cortex', 0.227), ('reinforcement', 0.214), ('frontal', 0.192), ('ica', 0.166), ('actor', 0.137), ('hidden', 0.129), ('units', 0.12), ('episodes', 0.117), ('acrobot', 0.112), ('dopaminergic', 0.112), ('da', 0.11), ('qt', 0.106), ('xt', 0.105), ('reward', 0.097), ('td', 0.096), ('steps', 0.094), ('ui', 0.093), ('significantly', 0.093), ('final', 0.089), ('baseline', 0.076), ('rule', 0.07), ('matriosome', 0.067), ('neurological', 0.067), ('striosome', 0.067), ('action', 0.064), ('signal', 0.06), ('neurophysiological', 0.06), ('anova', 0.059), ('lobe', 0.059), ('weights', 0.057), ('gradient', 0.054), ('winner', 0.054), ('confidence', 0.054), ('per', 0.052), ('cerebellum', 0.05), ('activity', 0.047), ('montr', 0.047), ('backpropagation', 0.047), ('zt', 0.047), ('receive', 0.045), ('approximators', 0.045), ('equidistant', 0.045), ('equipotent', 0.045), ('justification', 0.045), ('negating', 0.045), ('scheff', 0.045), ('schultz', 0.045), ('vta', 0.045), ('rl', 0.045), ('activation', 0.043), ('update', 0.043), ('neurons', 0.042), ('winning', 0.041), ('rt', 0.039), ('brain', 0.039), ('kalaska', 0.039), ('mammalian', 0.039), ('partement', 0.039), ('sarsa', 0.039), ('suri', 0.039), ('var', 0.038), ('layers', 0.037), ('plausible', 0.037), ('inspired', 0.036), ('unsupervised', 0.036), ('activations', 0.036), ('foster', 0.036), ('hippocampus', 0.036), ('yoshua', 0.036), ('slower', 0.034), ('biologically', 0.033), ('finding', 0.033), ('hypothesized', 0.031), ('neurophysiology', 0.031), ('feed', 0.03), ('sigmoidal', 0.03), ('simulate', 0.03), ('input', 0.029), ('policy', 0.029), ('learning', 0.029), ('barto', 0.029), ('beginning', 0.028), ('architecture', 0.028), ('temporal', 0.027), ('moving', 0.027), ('projects', 0.026), ('sutton', 0.026), ('al', 0.026), ('modeled', 0.026), ('bengio', 0.026), ('adaptive', 0.025), ('last', 0.025), ('discount', 0.025), ('radial', 0.025), ('motor', 0.024), ('inhibitory', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 33 nips-2004-Brain Inspired Reinforcement Learning
Author: Françcois Rivest, Yoshua Bengio, John Kalaska
Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1
2 0.1402999 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture
Author: Angela J. Yu, Peter Dayan
Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1
3 0.13614516 183 nips-2004-Temporal-Difference Networks
Author: Richard S. Sutton, Brian Tanner
Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as fixed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the first observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive definition of Node 2’s prediction is the probability that the first observation bit will be 1 two time steps from now. Node 3 similarly predicts the first observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the first observation bit and Node 4’s prediction, both on the next step. This is the first case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive definitions are difficult to express but which would nevertheless be completely defined as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modifiable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the defining TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to defining u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we define it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clarified by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed fixed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is specified by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved significant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not find that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufficiently small), which greatly simplifies comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the final predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a significant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically significant. TD methods appear to have a significant data-efficiency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the figure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically significant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always find such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the first few. Our results show that even in a fully observable setting there may be significant advantages to TD methods when learning TD-defined predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-defined predictions. Although most of our experiments have concerned the representational expressiveness and efficiency of TD-defined predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–121.
4 0.13407625 155 nips-2004-Responding to Modalities with Different Latencies
Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka
Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1
5 0.12401216 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
6 0.11618345 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
7 0.11294018 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
8 0.10580726 48 nips-2004-Convergence and No-Regret in Multiagent Learning
9 0.091816068 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern
10 0.089631125 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
11 0.084028825 132 nips-2004-Nonlinear Blind Source Separation by Integrating Independent Component Analysis and Slow Feature Analysis
12 0.081772812 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity
13 0.077628583 88 nips-2004-Intrinsically Motivated Reinforcement Learning
14 0.071426921 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
15 0.063698635 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
16 0.059806153 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
17 0.059522931 28 nips-2004-Bayesian inference in spiking neurons
18 0.058182843 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
19 0.056214061 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
20 0.054384705 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
topicId topicWeight
[(0, -0.164), (1, -0.136), (2, 0.15), (3, -0.047), (4, -0.037), (5, -0.009), (6, 0.119), (7, -0.041), (8, 0.001), (9, 0.02), (10, -0.022), (11, -0.014), (12, -0.079), (13, -0.203), (14, 0.149), (15, 0.046), (16, -0.022), (17, 0.139), (18, 0.046), (19, -0.08), (20, -0.004), (21, -0.049), (22, -0.004), (23, -0.126), (24, -0.095), (25, -0.027), (26, -0.153), (27, -0.029), (28, 0.235), (29, 0.033), (30, -0.002), (31, 0.064), (32, 0.049), (33, 0.092), (34, 0.055), (35, -0.118), (36, 0.189), (37, -0.1), (38, -0.034), (39, -0.125), (40, 0.106), (41, 0.005), (42, 0.108), (43, 0.05), (44, 0.047), (45, -0.013), (46, 0.048), (47, -0.03), (48, 0.072), (49, -0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.96323419 33 nips-2004-Brain Inspired Reinforcement Learning
Author: Françcois Rivest, Yoshua Bengio, John Kalaska
Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1
2 0.65392965 183 nips-2004-Temporal-Difference Networks
Author: Richard S. Sutton, Brian Tanner
Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efficient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the first. To be more precise, we must make a distinction between the extensive definition of a prediction, expressing its desired relationship to measurable data, and its TD definition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively defined as an expectation of the discounted sum of future rewards, while they are TD defined as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just defined or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the first two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a fixed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, first unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as fixed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the first observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive definition of Node 2’s prediction is the probability that the first observation bit will be 1 two time steps from now. Node 3 similarly predicts the first observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the first observation bit and Node 4’s prediction, both on the next step. This is the first case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive definitions are difficult to express but which would nevertheless be completely defined as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modifiable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the defining TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to defining u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we define it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clarified by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed fixed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is specified by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved significant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not find that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufficiently small), which greatly simplifies comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the final predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a significant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically significant. TD methods appear to have a significant data-efficiency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the figure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically significant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always find such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the first few. Our results show that even in a fully observable setting there may be significant advantages to TD methods when learning TD-defined predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-defined predictions. Although most of our experiments have concerned the representational expressiveness and efficiency of TD-defined predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–121.
3 0.60870111 155 nips-2004-Responding to Modalities with Different Latencies
Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka
Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1
4 0.53524297 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
Author: Yoshitatsu Matsuda, Kazunori Yamaguchi
Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efficiently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efficient and effective in large-size natural image processing.
5 0.52090585 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture
Author: Angela J. Yu, Peter Dayan
Abstract: We study the synthesis of neural coding, selective attention and perceptual decision making. A hierarchical neural architecture is proposed, which implements Bayesian integration of noisy sensory input and topdown attentional priors, leading to sound perceptual discrimination. The model offers an explicit explanation for the experimentally observed modulation that prior information in one stimulus feature (location) can have on an independent feature (orientation). The network’s intermediate levels of representation instantiate known physiological properties of visual cortical neurons. The model also illustrates a possible reconciliation of cortical and neuromodulatory representations of uncertainty. 1
6 0.42325985 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
7 0.38848779 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
8 0.37392098 193 nips-2004-Theories of Access Consciousness
10 0.35688978 180 nips-2004-Synchronization of neural networks by mutual learning and its application to cryptography
11 0.34852585 48 nips-2004-Convergence and No-Regret in Multiagent Learning
12 0.33987609 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution
13 0.32638523 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
14 0.30205238 64 nips-2004-Experts in a Markov Decision Process
15 0.3013539 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
16 0.29572123 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
17 0.2839154 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
18 0.27592561 88 nips-2004-Intrinsically Motivated Reinforcement Learning
19 0.26374611 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
20 0.26294947 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making
topicId topicWeight
[(13, 0.076), (15, 0.119), (19, 0.36), (26, 0.044), (31, 0.03), (33, 0.14), (35, 0.045), (39, 0.036), (50, 0.025), (51, 0.011), (54, 0.012), (81, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.77678394 33 nips-2004-Brain Inspired Reinforcement Learning
Author: Françcois Rivest, Yoshua Bengio, John Kalaska
Abstract: Successful application of reinforcement learning algorithms often involves considerable hand-crafting of the necessary non-linear features to reduce the complexity of the value functions and hence to promote convergence of the algorithm. In contrast, the human brain readily and autonomously finds the complex features when provided with sufficient training. Recent work in machine learning and neurophysiology has demonstrated the role of the basal ganglia and the frontal cortex in mammalian reinforcement learning. This paper develops and explores new reinforcement learning algorithms inspired by neurological evidence that provides potential new approaches to the feature construction problem. The algorithms are compared and evaluated on the Acrobot task. 1
2 0.70002902 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
Author: Eizaburo Doi, Michael S. Lewicki
Abstract: It has been suggested that the primary goal of the sensory system is to represent input in such a way as to reduce the high degree of redundancy. Given a noisy neural representation, however, solely reducing redundancy is not desirable, since redundancy is the only clue to reduce the effects of noise. Here we propose a model that best balances redundancy reduction and redundant representation. Like previous models, our model accounts for the localized and oriented structure of simple cells, but it also predicts a different organization for the population. With noisy, limited-capacity units, the optimal representation becomes an overcomplete, multi-scale representation, which, compared to previous models, is in closer agreement with physiological data. These results offer a new perspective on the expansion of the number of neurons from retina to V1 and provide a theoretical model of incorporating useful redundancy into efficient neural representations. 1
3 0.57242078 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
4 0.50110656 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
5 0.49720988 70 nips-2004-Following Curved Regularized Optimization Solution Paths
Author: Saharon Rosset
Abstract: Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremental algorithms which follow paths of regularized solutions, as the regularization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains “very close” to the path of optimal solutions and illustrate it with examples.
6 0.49609941 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
7 0.49460927 125 nips-2004-Multiple Relational Embedding
8 0.49458474 163 nips-2004-Semi-parametric Exponential Family PCA
9 0.49435043 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
10 0.49395937 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
11 0.4934535 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
12 0.49307626 127 nips-2004-Neighbourhood Components Analysis
13 0.49268836 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
14 0.49264497 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
15 0.49219638 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
16 0.49167252 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
17 0.49089515 28 nips-2004-Bayesian inference in spiking neurons
18 0.49036062 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
19 0.49020696 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
20 0.48978394 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images