nips nips2000 nips2000-63 knowledge-graph by maker-knowledge-mining

63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

Source: pdf

Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan

Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A key challenge for reinforcement learning is scaling up to large partially observable domains. [sent-5, score-0.31]

2 In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. [sent-6, score-0.31]

3 At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. [sent-7, score-0.375]

4 HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. [sent-9, score-0.359]

5 We describe a detailed experimental study comparing memory vs. [sent-10, score-0.179]

6 hierarchy using the HSM framework on a realistic corridor navigation task. [sent-11, score-0.256]

7 1 Introduction Reinforcement learning encompasses a class of machine learning problems in which an agent learns from experience as it interacts with its environment. [sent-12, score-0.396]

8 One fundamental challenge faced by reinforcement learning agents in real-world problems is that the state space can be very large, and consequently there may be a long delay before reward is received. [sent-13, score-0.385]

9 Previous work has addressed this issue by breaking down a large task into a hierarchy of subtasks or abstract behaviors [1, 3, 5]. [sent-14, score-0.343]

10 Another difficult issue is the problem of perceptual aliasing: different real-world states can often generate the same observations. [sent-15, score-0.103]

11 One strategy to deal with perceptual aliasing is to add memory about past percepts. [sent-16, score-0.574]

12 Short-term memory consisting of a linear (or tree-based) sequence of primitive actions and observations has been shown to be a useful strategy [2]. [sent-17, score-0.545]

13 However, considering short-term memory at a flat, uniform resolution of primitive actions would likely scale poorly to tasks with long decision sequences. [sent-18, score-0.572]

14 Thus, just as spatio-temporal abstraction of the state space improves scaling in completely observable environments, for large partially observable environments a similar benefit may result if we consider the space of past experience at variable resolution. [sent-19, score-0.913]

15 Given a task, we want a hierarchical strategy for rapidly bringing to bear past experience that is appropriate to the grain-size of the decisions being considered. [sent-20, score-0.724]

16 comer abstraction level: navigation T-junction Ii II oD 3 _0 . [sent-21, score-0.296]

17 01 dead end =::J C II ", ~ Ii II _0 D 1 _ O D2 _O D 3 _O D 3 / -0- ' * '---. [sent-23, score-0.033]

18 O~ , abstraction level: traversal '---v--:J abstraction level: primitive i o . [sent-32, score-0.468]

19 ~ Figure 1: This figure illustrates memory-based decision making at two levels in the hierarchy of a navigation task. [sent-59, score-0.473]

20 At each level, each decision point (shown with a star) examines its past experience to find states with similar history (shown with shadows). [sent-60, score-0.761]

21 At the abstract (navigation) level, observations and decisions occur at intersections. [sent-61, score-0.231]

22 At the lower (corridor-traversal) level, observations and decisions occur within the corridor. [sent-62, score-0.231]

23 In this paper, we show that considering past experience at a variable, taskappropriate resolution can speed up learning and greatly improve performance under perceptual aliasing. [sent-63, score-0.465]

24 The resulting approach, which we call Hierarchical Suffix Memory (HSM), is a general technique for solving large, perceptually aliased tasks. [sent-64, score-0.033]

25 2 Hierarchical Suffix Memory By employing short-term memory over abstract decisions, each of which involves a hierarchy of behaviors, we can apply memory at a more informative level of abstraction. [sent-65, score-0.658]

26 An important side-effect is that the agent can look at a decision point many steps back in time while ignoring the exact sequence of low-level observations and actions that transpired. [sent-66, score-0.569]

27 The problem of learning under perceptual aliasing can be viewed as discovering an informative sequence of past actions and observations (that is, a history suffix) for a given world state that enables an agent to act optimally in the world. [sent-68, score-1.051]

28 We can think of each situation in which an agent must choose an action (a choice point) as being labeled with a pair [0", l]: l refers to the abstraction level and 0" refers to the history suffix. [sent-69, score-0.894]

29 In the completely observable case, 0" has a length of one, and decisions are made based on the current observation. [sent-70, score-0.277]

30 In the partially observable case, we must additionally consider past history when making decisions. [sent-71, score-0.519]

31 In this case, the suffix 0", is some sequence of past observations and actions that must be learned. [sent-72, score-0.879]

32 This idea of representing memory as a variable-length suffix derives from work on learning approximations of probabilistic suffix automata [2, 4]. [sent-73, score-1.139]

33 Given an abstraction levell and choice point s within l: for each potential future decision, d, examine the history at level l to find a set of past choice points that have executed d and whose incoming (suffix) history most closely matches that of the current point. [sent-75, score-1.237]

34 Call this set of instances the "voting set" for decision d. [sent-76, score-0.187]

35 Choose dt as the decision with the highest average discounted sum of reward over the voting set. [sent-78, score-0.568]

36 Here, t is the event counter of the current choice point at level l. [sent-80, score-0.203]

37 Execute the decision dt and record: 0t, the resulting observation; Tt, the reward received; and nt, the duration of abstract action dt (measured by the number of primitive environment transitions executed by the abstract action). [sent-82, score-1.052]

38 Note that for every environment transition from state Si-l to state Si with reward Ti and discount I, we accumulate any reward and update the discount factor: Tt ~ Tt + ItTi It ~ lIt 4. [sent-83, score-0.595]

39 Update the Q-value for the current decision point and for each instance in the voting set using the decision, reward, and duration values recorded along with the instance. [sent-84, score-0.399]

40 Model-free: use an SMDP Q-Iearning update rule ((3 is the learning rate): QI(St, dt ) ~ (1- (3)QI(St, dt ) + (3h + It max QI(St+n" d)) d Model-based: if a state-transition model is being used, a sweep of value iteration can be executed 1 . [sent-85, score-0.572]

41 NSM records each of its raw experiences as a linear chain. [sent-89, score-0.123]

42 To choose the next action, the agent evaluates the outcomes of the k "nearest" neighbors in the experience chain. [sent-90, score-0.44]

43 NSM evaluates the closeness between two states according to the match length of the suffix chain preceding the states. [sent-91, score-0.664]

44 The chain can either be grown indefinitely, or old experiences can be replaced after the chain reaches a maximum length. [sent-92, score-0.155]

45 With NSM, a model-free learning method, HSM uses an SMDP Q-Iearning rule as described above. [sent-93, score-0.028]

46 USM also records experience in a linear time chain. [sent-94, score-0.222]

47 However, instead of attempting to choose actions based on a greedy history match, USM tries to explicitly determine how much memory is useful for predicting reward. [sent-95, score-0.537]

48 To do this, the agent builds a tree-like structure for state representation online, selectively adding depth to the tree if the additional history distinction helps to predict reward. [sent-96, score-0.451]

49 With USM, which learns a model, HSM updates the Q-values by doing one sweep of value iteration with the leaves of the tree as states. [sent-97, score-0.153]

50 Finally, to implement the hierarchy of behaviors, in principle any hierarchical reinforcement learning method may be used. [sent-98, score-0.362]

51 When executed, an abstract machine executes a partial policy and returns control to the caller upon termination. [sent-100, score-0.033]

52 The HAM architecture uses a Q-Iearning rule modified for SMDPs. [sent-101, score-0.028]

53 lIn this context , "state" is represented by the history suffix. [sent-102, score-0.195]

54 That is, an instance is in a "state" if the instance's incoming history matches the suffix representing the state. [sent-103, score-0.779]

55 In this case, the voting set is exactly the set of instances in the same state as the current choice point 8t J Shor1 SeilS Homad(l) Ihl'l-~ bJlJn~"': LL (-w"'r'lR4~. [sent-104, score-0.309]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('suffix', 0.465), ('hsm', 0.349), ('history', 0.195), ('past', 0.188), ('dt', 0.183), ('memory', 0.179), ('abstraction', 0.168), ('experience', 0.166), ('hierarchy', 0.161), ('agent', 0.161), ('nsm', 0.155), ('usm', 0.155), ('decision', 0.148), ('reward', 0.146), ('decisions', 0.125), ('executed', 0.124), ('actions', 0.11), ('hierarchical', 0.107), ('smdp', 0.1), ('primitive', 0.099), ('navigation', 0.095), ('reinforcement', 0.094), ('aliasing', 0.091), ('voting', 0.091), ('behaviors', 0.091), ('level', 0.089), ('observable', 0.089), ('action', 0.076), ('perceptual', 0.075), ('observations', 0.073), ('qi', 0.071), ('tt', 0.068), ('experiences', 0.067), ('ham', 0.067), ('state', 0.065), ('evaluates', 0.06), ('duration', 0.058), ('records', 0.056), ('choose', 0.053), ('discount', 0.053), ('informative', 0.05), ('sweep', 0.05), ('challenge', 0.05), ('partially', 0.047), ('choice', 0.046), ('chain', 0.044), ('matches', 0.043), ('st', 0.043), ('sequence', 0.043), ('incoming', 0.042), ('environments', 0.042), ('strategy', 0.041), ('instances', 0.039), ('learns', 0.039), ('refers', 0.038), ('nearest', 0.037), ('illustrates', 0.037), ('rapidly', 0.037), ('resolution', 0.036), ('ri', 0.035), ('environment', 0.035), ('current', 0.034), ('point', 0.034), ('updates', 0.034), ('instance', 0.034), ('executing', 0.033), ('dead', 0.033), ('star', 0.033), ('levell', 0.033), ('closeness', 0.033), ('comer', 0.033), ('executes', 0.033), ('occasionally', 0.033), ('od', 0.033), ('perceptually', 0.033), ('subtasks', 0.033), ('traversal', 0.033), ('occur', 0.033), ('match', 0.033), ('levels', 0.032), ('update', 0.032), ('tree', 0.03), ('encompasses', 0.03), ('breaking', 0.03), ('bringing', 0.03), ('itti', 0.03), ('examines', 0.03), ('lit', 0.03), ('automata', 0.03), ('bear', 0.03), ('faced', 0.03), ('parr', 0.03), ('scaling', 0.03), ('situation', 0.03), ('si', 0.03), ('length', 0.029), ('variable', 0.029), ('uses', 0.028), ('issue', 0.028), ('abstracts', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan

2 0.24258 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.

3 0.19752252 105 nips-2000-Programmable Reinforcement Learning Agents

Author: David Andre, Stuart J. Russell

Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.

4 0.16917913 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.

5 0.16464728 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

6 0.095644504 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

7 0.094189316 48 nips-2000-Exact Solutions to Time-Dependent MDPs

8 0.083449423 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites

9 0.082694784 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

10 0.080540158 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

11 0.076873265 43 nips-2000-Dopamine Bonuses

12 0.06039032 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

13 0.054099586 66 nips-2000-Hippocampally-Dependent Consolidation in a Hierarchical Model of Neocortex

14 0.053350091 113 nips-2000-Robust Reinforcement Learning

15 0.047952998 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning

16 0.040748142 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

17 0.035760745 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

18 0.035087533 4 nips-2000-A Linear Programming Approach to Novelty Detection

19 0.034448639 75 nips-2000-Large Scale Bayes Point Machines

20 0.0344459 148 nips-2000-`N-Body' Problems in Statistical Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, -0.073), (2, 0.101), (3, -0.343), (4, -0.273), (5, 0.031), (6, -0.012), (7, -0.007), (8, -0.021), (9, -0.004), (10, 0.033), (11, 0.008), (12, -0.065), (13, -0.003), (14, -0.026), (15, 0.033), (16, 0.012), (17, -0.037), (18, -0.031), (19, 0.051), (20, -0.122), (21, -0.049), (22, 0.052), (23, 0.018), (24, -0.03), (25, 0.088), (26, 0.175), (27, 0.027), (28, -0.095), (29, -0.104), (30, -0.253), (31, 0.055), (32, 0.07), (33, 0.002), (34, -0.021), (35, 0.173), (36, -0.042), (37, 0.004), (38, -0.033), (39, 0.039), (40, 0.024), (41, -0.015), (42, 0.014), (43, -0.049), (44, 0.1), (45, 0.039), (46, 0.138), (47, 0.028), (48, 0.062), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97719383 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan

2 0.8384819 105 nips-2000-Programmable Reinforcement Learning Agents

Author: David Andre, Stuart J. Russell

3 0.80638987 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

4 0.53426117 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

5 0.53181165 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

6 0.3923884 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites

7 0.33046985 48 nips-2000-Exact Solutions to Time-Dependent MDPs

8 0.32386112 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

9 0.30044463 66 nips-2000-Hippocampally-Dependent Consolidation in a Hierarchical Model of Neocortex

10 0.29345682 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

11 0.27864951 43 nips-2000-Dopamine Bonuses

12 0.27512971 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

13 0.20788358 125 nips-2000-Stability and Noise in Biochemical Switches

14 0.16821305 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

15 0.15135285 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

16 0.15068218 147 nips-2000-Who Does What? A Novel Algorithm to Determine Function Localization

17 0.14911906 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning

18 0.14804624 129 nips-2000-Temporally Dependent Plasticity: An Information Theoretic Account

19 0.14483412 103 nips-2000-Probabilistic Semantic Video Indexing

20 0.14038499 116 nips-2000-Sex with Support Vector Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.033), (17, 0.056), (33, 0.022), (55, 0.011), (62, 0.664), (65, 0.02), (67, 0.029), (79, 0.012), (81, 0.019), (90, 0.012), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98838395 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan

2 0.94305015 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

3 0.88852906 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

4 0.71113682 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

5 0.6811806 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

6 0.65716267 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

7 0.62540805 105 nips-2000-Programmable Reinforcement Learning Agents

8 0.6214627 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

9 0.54976541 48 nips-2000-Exact Solutions to Time-Dependent MDPs

10 0.54022503 80 nips-2000-Learning Switching Linear Models of Human Motion

11 0.51597446 113 nips-2000-Robust Reinforcement Learning

12 0.47956309 43 nips-2000-Dopamine Bonuses

13 0.47153205 92 nips-2000-Occam's Razor