nips nips2000 nips2000-26 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
Reference: text
sentIndex sentText sentNum sentScore
1 An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. [sent-5, score-0.29]
2 In this paper, we study hierarchical learning using the framework of options. [sent-6, score-0.196]
3 We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. [sent-7, score-0.615]
4 We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. [sent-8, score-0.375]
5 Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective. [sent-9, score-0.669]
6 1 Introduction Researchers in the field of reinforcement learning have recently focused considerable attention on temporally abstract actions (e. [sent-10, score-0.385]
7 The term temporally abstract describes actions that can take variable amounts of time. [sent-13, score-0.23]
8 One motivation for using temporally abstract actions is that they can be used to exploit the hierarchical structure of a problem. [sent-14, score-0.406]
9 Among other things, a hierarchical structure is a natural way to incorporate prior knowledge into a learning system by allowing reuse of temporally abstract actions whose policies were learned in other tasks. [sent-15, score-0.467]
10 Learning in a hierarchy can also significantly reduce the number of situations between which a learning agent needs to discriminate. [sent-16, score-0.216]
11 We use the framework of options [6, 9], which extends the theory of reinforcement learning to include temporally abstract actions. [sent-17, score-0.558]
12 In many cases, accurately executing an option's policy does not depend on all state features available to the learning agent. [sent-18, score-0.279]
13 Further, the features that are relevant often differ from option to option. [sent-19, score-0.402]
14 Within a hierarchical learning system, it is possible to perform option-specific state abstraction by which irrelevant features specific to each option are ignored. [sent-20, score-0.964]
15 Using option-specific state abstraction in a hierarchical learning system can save memory through the development of compact state representations, and it can accelerate learning because of the generalization induced by the abstraction. [sent-21, score-0.742]
16 Dietterich [2] introduced action-specific state abstraction in a hierarchy of temporally abstract actions. [sent-22, score-0.543]
17 However, his approach requires the system developer to define a set of relevant state features for each action prior to learning. [sent-23, score-0.203]
18 One way to remedy this problem is to use an automated process for constructing state representations. [sent-25, score-0.182]
19 We apply McCallum' s U-Tree algorithm [4] to individual options to achieve automated, option-specific state abstraction. [sent-26, score-0.451]
20 The U-Tree algorithm automatically builds a state-feature representation starting from one that makes no distinctions between different observation vectors. [sent-27, score-0.355]
21 Section 3 introduces modifications necessary to make the U-Tree algorithm suitable for learning in a hierarchical system. [sent-30, score-0.284]
22 A decision treethe U-Tree-sorts a new instance Tt based on its components and assigns it to a unique leaf of the tree. [sent-34, score-0.268]
23 The distinctions associated with a leaf are determined by the root-to-leaf path. [sent-35, score-0.511]
24 For each leaf-action pair (Lj ,a), the algorithm keeps an action value Q(Lj ,a) estimating the future discounted reward associated with being in Lj and executing a. [sent-36, score-0.53]
25 The utility of a leaf is denoted U(Lj ) = maxaQ(Lj ,a). [sent-37, score-0.209]
26 The algorithm also keeps a model consisting of estimated transition probabilities Pr (Lk ILj , a) and expected immediate rewards R( Lj, a) computed from the transition instances. [sent-38, score-0.293]
27 Lk One can use other reinforcement learning algorithms to update the action values, such as Q-learning or prioritized sweeping. [sent-40, score-0.26]
28 The U-Tree algorithm periodically adds new distinctions to the tree in the form of temporary nodes, called fringe nodes, and performs statistical tests to see whether the added distinctions increase the predictive power ofthe U-Tree. [sent-41, score-0.732]
29 Each distinction is based on (1) a perceptual dimension, which is either an observation or a previous action, and (2) a history index, indicating how far back in the current history the dimension will be examined. [sent-42, score-0.323]
30 Each leaf of the tree is extended with a subtree of a fixed depth, z, constructed from permutations of all distinctions not already on the path to the leaf. [sent-43, score-0.621]
31 The instances associated with the leaf are distributed to the leaves of the added subtree-the fringe nodes-according to the corresponding distinctions. [sent-44, score-0.473]
32 A statistical test, the Kolmogorov-Smirnov (KS) test, compares the distributions of future discounted reward of the leaf node's policy action with that of a fringe node's policy action. [sent-45, score-0.847]
33 The distribution of future discounted reward associated with a node Lj and its policy action a = argmax a Q(Lj ,a) is composed of the estimated future discounted reward of individual instances Tt E T(Lj , a) given by: V(Tt ) = Tt + l + yLPr(Lk ILj ,a)U(Lk ). [sent-46, score-0.915]
34 The U-Tree algorithm retains the subtree of distinctions i at a leaf Lj if the sum of the KS statistical differences over the fringe nodes F(Lj , i) of the subtree is (1) larger than the sum of the KS differences of all other subtrees, and (2) exceeds some threshold 8. [sent-48, score-0.878]
35 That is, the tree is extended from leaf L j with a subtree i of new distinctions if for all subtrees m -=I- i: ~ ~ £. [sent-49, score-0.662]
36 Whenever the tree is extended, the action values of the previous leaf node are passed on to the new leaf nodes. [sent-56, score-0.638]
37 One can restrict the number of distinctions an agent can make at anyone time by imposing a limit on the depth of the V-Tree. [sent-57, score-0.402]
38 The length of the history the algorithm needs to retain depends only on the tree size and not on the size of the overall state set. [sent-58, score-0.381]
39 In previous experiments, the V-Tree algorithm was able to learn a compact state representation together with a satisfactory policy in a complex driving task [4]. [sent-60, score-0.304]
40 A version of the V-Tree algorithm suitable for continuous state spaces has also been developed and successfully used in robot soccer [10]. [sent-61, score-0.161]
41 3 Adapting the U -Tree algorithm for options We now turn to the issue of adapting the V-Tree algorithm for use with options and hierarchicallearning architectures. [sent-62, score-0.732]
42 Primitive actions generalize to options that always terminate after one time step. [sent-64, score-0.439]
43 It is easy to define hierarchies of options in which the policy of an option can select other options. [sent-65, score-0.824]
44 A local reward function can be associated with an option to facilitate learning the option's policy. [sent-66, score-0.609]
45 What makes the V-Tree algorithm so suitable for performing option-specific state abstraction is that a V-Tree simultaneously defines a state representation and a policy over this representation. [sent-67, score-0.627]
46 With a separate V-Tree assigned to each option, the algorithm is able to perform state abstraction separately for each option while modifying its policy. [sent-68, score-0.894]
47 Because options at different levels of a hierarchy operate on different time scales, their transition instances must take different forms. [sent-69, score-0.543]
48 ~ 11- 1 r/ - k+i is the discounted sum of rewards received during the execution of O/- b and ~~ k is the previous instance. [sent-71, score-0.268]
49 Since options at one level in a hierarchy are executed one at a time, they will each experience a different sequence of transition instances. [sent-72, score-0.545]
50 For the V-Tree algorithm to work under these conditions, the U-Tree of each option has to keep its own history of instances and base distinctions on these instances alone. [sent-73, score-1.022]
51 Because an option terminates and may not be executed again for some time, its associated history will be made up of finite segments corresponding to separate executions of the option. [sent-75, score-0.68]
52 The first transition Figure 1: The Taxi task instance recorded during an execution is independent of the last instance recorded during a previous execution. [sent-76, score-0.426]
53 With these modifications, the V-Tree algorithm can be applied to hierarchical learning with options. [sent-78, score-0.259]
54 1 Intra-option learning When several options operate in the same parts of the state space and choose from among the same actions, it is possible to learn something about one option from the behavior generated by the execution of other options. [sent-80, score-1.006]
55 In a process called intra-option learning [8], the action values of one option are updated based on actions executed in another, associated option. [sent-81, score-0.833]
56 The update only occurs if the action executed in the latter has a non-zero probability of being executed in the former. [sent-82, score-0.343]
57 Similarly, we can base distinctions in the V-Tree associated with one option on transition instances recorded during the execution of another option. [sent-83, score-1.025]
58 We do this by adding instances recorded during the execution of one option to the history of each associated option. [sent-84, score-0.81]
59 For the scheme to work, we introduce a vector of rewards R = {Kj'} in an instance ~o, t where Rf' is the discounted sum of local rewards for each option 0' associated with Ot-k. [sent-86, score-0.711]
60 The taxi is assigned the task of delivering passengers from their locations to their destination, both chosen at random from the set of pick-up/drop-off sites P = {1,2,3,4}. [sent-88, score-0.458]
61 The taxi agent's observation vector s = (x,y,i,d ) is composed of the (x,y) -position of the taxi, the location i E P U {taxi} of the current passenger, and this passenger's destination d E P. [sent-89, score-0.429]
62 The actions available to the taxi are Pick-up, Drop-off, and Move(m), m E {N,E,S , w}, the four cardinal directions. [sent-90, score-0.447]
63 When a passenger is delivered, a new passenger appears at a random pickup site. [sent-91, score-0.329]
64 We further introduced a local reward Rf for Naviga te(p), identical to the global reward provided to the agent with the exception that Rf = 9 for reaching GP. [sent-93, score-0.341]
65 In our application of the V-Tree algorithm to the taxi problem, the history of each option had a maximum length of 6,000 instances. [sent-94, score-0.909]
66 If this length was exceeded, the oldest instance in the history was discarded. [sent-95, score-0.173]
67 Expanding the tree was only considered if there were more than 3,000 instances in the history. [sent-96, score-0.166]
68 0, except when no distinctions were present in the tree, in which case 8 = 0. [sent-98, score-0.257]
69 The algorithm used this lower threshold when the agent was not able to make any distinctions because it is difficult in this case to accumulate enough evidence of statistical difference to accept a distinction. [sent-100, score-0.427]
70 Since the V-Tree algorithm does not go back and reconsider distinctions in the tree, it is important to reduce the number of incorrect distinctions due to sparse statistical evidence. [sent-101, score-0.602]
71 Therefore, our implementation only compared two distributions of future discounted reward between leaves if each contained more than 15 instances. [sent-102, score-0.295]
72 Because the taxi task is fully observable, we set the history index of the V-tree algorithm to zero. [sent-103, score-0.55]
73 To avoid re-tuning 't, the £-random part ensured that all actions were executed regularly. [sent-106, score-0.236]
74 We randomly selected one of the options Naviga te(p) to execute, and randomly selected a new position for the taxi whenever it reached p, ignoring the issue of delivering a passenger. [sent-108, score-0.819]
75 In one set of runs, the algorithm used intra-option learning, and in another set, it used regular learning in which the V-Trees of different options did not share any instances. [sent-110, score-0.438]
76 In a second set of experiments, the policies of the options and the overall Taxi task were learned in parallel. [sent-111, score-0.407]
77 We allowed the policy of the overall task to choose between the options Navigate(p), and the actions pick-up and Drop-off. [sent-112, score-0.583]
78 The reward provided for the overall task was the sum of global reward and local reward of the option currently being executed (cf. [sent-113, score-0.948]
79 When a passenger was delivered, a new taxi position was selected randomly and a new passenger appeared at a randomly selected pickup site. [sent-115, score-0.767]
80 At intervals of 500 time steps, the V-Trees of the options were saved and evaluated separately. [sent-119, score-0.29]
81 The evaluation consisted of fixing a target, repeatedly navigating to that target for 25,000 time steps, randomly repositioning the taxi every time the target was reached, repeating for all targets, and adding the rewards. [sent-120, score-0.404]
82 Faster convergence is due to the fact that the histories associated with the options fill up more quickly during intra-option learning. [sent-122, score-0.371]
83 The target of an option is only reached once during each execution of the option, whereas it might be reached several times during the execution of another option. [sent-124, score-0.75]
84 Nodes that represent distinctions are drawn as circles, and leaf nodes are shown as squares or, in most cases, omitted. [sent-127, score-0.539]
85 In the figure, a denotes a distinction over the previously executed option (in the order Navigate(p), pick-up and Drop-off), and other letters denote a distinction over the corresponding observation. [sent-128, score-0.641]
86 The V-Trees in the figure contain a total of 188 leaf nodes. [sent-132, score-0.209]
87 Across 10 runs, the number of leaf nodes varied from 154 to 259, with an average of 189. [sent-133, score-0.282]
88 Some leaf nodes were never visited, making the actual number of states even smaller. [sent-134, score-0.282]
89 However, we believe that the memory savings due to option-specific state abstraction in larger tasks will significantly outweigh the memory requirement for V-Trees. [sent-138, score-0.44]
90 6 Conclusion We have shown that the V-Tree algorithm can be used with options in a hierarchical learning system. [sent-139, score-0.549]
91 Our results suggest that automated option-specific state abstraction performed by the algorithm is an attractive approach to making hierarchical learning systems more effective. [sent-140, score-0.732]
92 Although our testbed was small, we believe this is an important first step toward automated state abstraction in hierarchies. [sent-141, score-0.45]
93 We also incorporated intra-option learning into the V-Tree algorithm, a method that allows a learning agent to extract more information from the training data. [sent-142, score-0.197]
94 Results show that intra-option learning can significantly improve the performance of a learning agent performing option-specific state abstraction. [sent-143, score-0.295]
95 Future work will examine the performance of option-specific state abstraction using the V-Tree algorithm in larger, more realistic tasks. [sent-145, score-0.429]
96 We also plan to develop a version of ~aVigate(l) ddc10~ x ~'(3) Figure 3: U-Trees for different policies the U-Tree algorithm that goes back in the tree and reconsiders distinctions. [sent-146, score-0.177]
97 This has the potential to improve the performance of the algorithm by correcting nodes for which incorrect distinctions were made. [sent-147, score-0.418]
98 Hierarchical reinforcement leaming with the MAXQ value function decomposition. [sent-154, score-0.19]
99 (1996) Emergent hierarchical control structures: Leaming reactivelhierarchical relationships in reinforcement environments. [sent-169, score-0.261]
100 (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. [sent-224, score-0.378]
wordName wordTfidf (topN-words)
[('option', 0.402), ('taxi', 0.33), ('options', 0.29), ('abstraction', 0.268), ('distinctions', 0.257), ('leaf', 0.209), ('lj', 0.174), ('hierarchical', 0.151), ('passenger', 0.144), ('execution', 0.119), ('executed', 0.119), ('actions', 0.117), ('reward', 0.117), ('history', 0.114), ('temporally', 0.113), ('reinforcement', 0.11), ('agent', 0.107), ('action', 0.105), ('policy', 0.1), ('state', 0.098), ('instances', 0.093), ('discounted', 0.093), ('lk', 0.088), ('tt', 0.084), ('automated', 0.084), ('fringe', 0.082), ('navigate', 0.082), ('subtree', 0.082), ('leaming', 0.08), ('dietterich', 0.074), ('nodes', 0.073), ('tree', 0.073), ('transition', 0.072), ('ks', 0.071), ('hierarchy', 0.064), ('algorithm', 0.063), ('ilj', 0.062), ('distinction', 0.06), ('instance', 0.059), ('mccallum', 0.056), ('rewards', 0.056), ('precup', 0.053), ('gp', 0.05), ('delivering', 0.048), ('associated', 0.045), ('learning', 0.045), ('leaves', 0.044), ('task', 0.043), ('reached', 0.043), ('node', 0.042), ('singh', 0.042), ('rf', 0.042), ('policies', 0.041), ('digney', 0.041), ('naviga', 0.041), ('pickup', 0.041), ('subtrees', 0.041), ('ylpr', 0.041), ('future', 0.041), ('regular', 0.04), ('runs', 0.039), ('depth', 0.038), ('assigned', 0.037), ('recorded', 0.037), ('memory', 0.037), ('destination', 0.036), ('executing', 0.036), ('histories', 0.036), ('observation', 0.035), ('sutton', 0.033), ('overall', 0.033), ('softmax', 0.032), ('terminate', 0.032), ('dl', 0.032), ('hierarchies', 0.032), ('maxq', 0.032), ('retains', 0.03), ('keeps', 0.03), ('keams', 0.03), ('aaai', 0.03), ('composed', 0.028), ('selected', 0.028), ('something', 0.028), ('randomly', 0.026), ('modifying', 0.026), ('adapting', 0.026), ('delivered', 0.026), ('national', 0.026), ('te', 0.025), ('incorrect', 0.025), ('modifications', 0.025), ('con', 0.025), ('motivation', 0.025), ('operate', 0.024), ('rr', 0.024), ('target', 0.024), ('solla', 0.024), ('experiments', 0.023), ('ma', 0.023), ('attractive', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
2 0.24258 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
3 0.19797207 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in two of its arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of continuous functions with these and other properties. We apply this new class of functions to the task of modeling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.
4 0.18502127 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
5 0.17152911 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
6 0.14968166 105 nips-2000-Programmable Reinforcement Learning Agents
7 0.114091 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
8 0.10494055 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
9 0.10238975 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
10 0.097281851 48 nips-2000-Exact Solutions to Time-Dependent MDPs
11 0.08054357 43 nips-2000-Dopamine Bonuses
12 0.070265494 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
13 0.069927938 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
14 0.067963071 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
15 0.061035071 113 nips-2000-Robust Reinforcement Learning
16 0.051484562 4 nips-2000-A Linear Programming Approach to Novelty Detection
17 0.051106848 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning
18 0.049223065 148 nips-2000-`N-Body' Problems in Statistical Learning
19 0.042228844 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
20 0.041257787 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites
topicId topicWeight
[(0, 0.177), (1, -0.069), (2, 0.131), (3, -0.368), (4, -0.241), (5, 0.04), (6, -0.018), (7, -0.003), (8, -0.032), (9, -0.001), (10, -0.006), (11, 0.016), (12, -0.073), (13, 0.006), (14, 0.012), (15, 0.047), (16, -0.004), (17, 0.005), (18, -0.021), (19, 0.054), (20, -0.095), (21, -0.153), (22, 0.021), (23, -0.036), (24, -0.014), (25, 0.057), (26, 0.08), (27, 0.076), (28, -0.045), (29, -0.266), (30, -0.255), (31, -0.018), (32, 0.117), (33, -0.042), (34, -0.065), (35, 0.042), (36, -0.026), (37, -0.099), (38, -0.176), (39, -0.054), (40, 0.064), (41, 0.04), (42, 0.117), (43, -0.091), (44, -0.056), (45, 0.028), (46, 0.068), (47, -0.074), (48, -0.049), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.95652395 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
2 0.79616493 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
3 0.69011319 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
4 0.57915395 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia
Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in two of its arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of continuous functions with these and other properties. We apply this new class of functions to the task of modeling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints.
5 0.45149928 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
6 0.42521426 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
7 0.36950502 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
8 0.2918933 48 nips-2000-Exact Solutions to Time-Dependent MDPs
9 0.2757915 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
10 0.27009225 43 nips-2000-Dopamine Bonuses
11 0.26103908 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
12 0.24126935 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
13 0.23800837 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
14 0.21884927 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization
15 0.18099456 148 nips-2000-`N-Body' Problems in Statistical Learning
16 0.17927264 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems
17 0.17866361 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites
18 0.17846015 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
19 0.17821097 66 nips-2000-Hippocampally-Dependent Consolidation in a Hierarchical Model of Neocortex
20 0.17794034 113 nips-2000-Robust Reinforcement Learning
topicId topicWeight
[(10, 0.038), (17, 0.074), (32, 0.017), (33, 0.03), (36, 0.012), (54, 0.02), (55, 0.027), (62, 0.198), (64, 0.26), (65, 0.035), (67, 0.067), (75, 0.014), (76, 0.025), (79, 0.018), (81, 0.015), (90, 0.031), (91, 0.021), (97, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.87249786 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
2 0.65554523 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
3 0.63831317 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
Author: Igor V. Cadez, Padhraic Smyth
Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of
4 0.63316083 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
5 0.59096897 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
6 0.58042896 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
7 0.55258358 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
8 0.54380828 105 nips-2000-Programmable Reinforcement Learning Agents
9 0.52358127 80 nips-2000-Learning Switching Linear Models of Human Motion
10 0.5001151 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
11 0.50000793 92 nips-2000-Occam's Razor
12 0.49934891 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
13 0.49840888 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
14 0.49804485 48 nips-2000-Exact Solutions to Time-Dependent MDPs
15 0.49694717 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
16 0.48170549 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
17 0.4797073 138 nips-2000-The Use of Classifiers in Sequential Inference
18 0.47820532 22 nips-2000-Algorithms for Non-negative Matrix Factorization
19 0.4766002 113 nips-2000-Robust Reinforcement Learning
20 0.4755671 43 nips-2000-Dopamine Bonuses