nips nips2000 nips2000-105 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process. [sent-4, score-0.635]
2 The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. [sent-5, score-0.647]
3 We demonstrate by example that agent programs written in the language are concise as well as modular. [sent-7, score-0.564]
4 This facilitates state abstraction and the transferability of learned skills. [sent-8, score-0.197]
5 This paper describes extensions to the HAM language that substantially increase its expressive power, using constructs borrowed from programming languages. [sent-12, score-0.363]
6 ) More importantly, the availability of an expressive language allows the agent to learn and generalize behavioral abstractions that would be far more difficult to learn in a less expressive language. [sent-15, score-0.581]
7 For example, the ability to specify parameterized behaviors allows multiple behaviors such as WalkEast, W alkN arth, Walk West, WalkS outh to be combined into a single behavior W alk( d) where d is a direction parameter. [sent-16, score-0.211]
8 Furthermore, if a behavior is appropriately parameterized, decisions within the behavior can be made independently ofthe "calling context" (the hierarchy of tasks within which the behavior is being executed). [sent-17, score-0.138]
9 Interrupts and aborts in particular are very important in physical behaviors-more so than in computation-and are crucial in allowing for modularity in behavioral descriptions. [sent-22, score-0.159]
10 These features are all common in robot programming languages [2, 3, 5]; the key element of our approach is that behaviors need only be partially described; reinforcement learning does the rest. [sent-23, score-0.317]
11 To tie our extended language to existing reinforcement learning algorithms, we utilize Parr and Russell's [8] notion of the joint semi-Markov decision process (SMDP) created when a HAM is composed with an environment (modeled as an MDP). [sent-24, score-0.405]
12 The joint SMDP state space consists of the cross-product of the machine states in the HAM and the states in the original MDP; the dynamics are created by the application of the HAM in the MDP. [sent-25, score-0.389]
13 Furthermore, Parr and Russell show that the joint SMDP can be reduced to an equivalent SMDP with a state space consisting only of the states where the HAM does not specify an action, which reduces the complexity of the SMDP problem that must be solved. [sent-27, score-0.277]
14 We show that these results hold for our extended language of Programmable HAMs (PHAMs). [sent-28, score-0.234]
15 To demonstrate the usefulness of the new language, we show a small, complete program for a complex environment that would require a much larger program in previous formalisms. [sent-29, score-0.239]
16 A solution to a MDP is an optimal policy 7['* that maps from S -+ A and achieves maximum expected discounted reward for the agent. [sent-33, score-0.128]
17 , it specifies a distribution over both output states and action durations. [sent-37, score-0.239]
18 The discount factor, /3, is generalized to be a function, /3(s, a), that represents the expected discount factor when action a is taken in state s. [sent-39, score-0.302]
19 The HAM language [8] provides for partial specification of agent programs. [sent-41, score-0.457]
20 A HAM program consists of a set of partially specified Moore machines. [sent-42, score-0.154]
21 Transitions in each machine may depend stochastically on (features of) the environment state, and the outputs of each machine are primitive actions or nonrecursive invocations of other machines. [sent-43, score-0.218]
22 The states in each machine can be of four types: {start, stop, action, choice}. [sent-44, score-0.148]
23 Each machine has a single distinguished start state and may have one or more distinguished stop states. [sent-45, score-0.195]
24 When a machine is invoked, control starts at the start state; stop states return control back to the calling machine. [sent-46, score-0.275]
25 A call state invokes another machine as a subroutine. [sent-48, score-0.128]
26 A choice state may have several possible next states; after learning, the choice is reduced to a single next state. [sent-49, score-0.276]
27 3 Programmable HAMs Consider the problem of creating a HAM program for the Deliver-Patrol domain presented in Figure 1, which has 38,400 states. [sent-50, score-0.1]
28 In addition to delivering mail and picking up occasional additional rewards while patrolling (both of which require efficient navigation and safe maneuvering), the robot must keep its battery charged (lest it be stranded) and its camera lens clean (lest it crash). [sent-51, score-0.429]
29 Because all the 5 x 5 "rooms" are similar, one can write a "traverse the room" HAM routine that works in all rooms, but a different routine is needed for each direction (north-south, south-north, east-west, etc. [sent-53, score-0.212]
30 Such redundancy suggests the need for a "traverse the room" routine that is parameterized by the desired direction. [sent-55, score-0.147]
31 Consider also the fact that the robot should clean its camera lens whenever it gets dirty. [sent-56, score-0.176]
32 Right-facing half-circles are start states, left-facing half-circles are stop states, hexagons are call states, ovals are primitive actions, and squares are choice points. [sent-68, score-0.198]
33 When arguments to call states are in braces, then the choice is over the arguments to pass to the subroutine. [sent-70, score-0.186]
34 The RootO PHAM specifies an interrupt to clean the camera lens whenever it gets dirty; the WorkO PHAM interrupts its patrolling whenever there is mail to be delivered. [sent-71, score-0.611]
35 The arrows in the drawing of the room indicate the behavior specified by the pO transition function in ToDoor(dest,sp). [sent-73, score-0.252]
36 In the HAM language, this conditional action must be inserted after every state in every HAM. [sent-80, score-0.182]
37 An interrupt mechanism with appropriate scoping would obviate the need for such widespread mutilation. [sent-81, score-0.106]
38 We provide here an informal summary of the language features that enable concise agent programs to be written. [sent-83, score-0.594]
39 The corresponding HAM program requires 63 machines, many of which have significantly more states than their PHAM counterparts. [sent-85, score-0.212]
40 The PHAM language adds several structured programming constructs to the HAM language. [sent-86, score-0.286]
41 To enable this, we introduce two additional types of states in the PHAM: internal states, which execute an internal computational action (such as setting memory variables to a function of the current state), and null states, which have no direct effect and are used for computational convenience. [sent-87, score-0.425]
42 Parameterization is key for expressing concise agent specifications, as can be seen in the Deliver-Patrol task. [sent-88, score-0.234]
43 Ok, the values of which must be filled in by the calling subroutine (and can depend on any function of the machine, parameter, memory, and environment state). [sent-92, score-0.223]
44 The ToDoor( dest,speed) subroutine is for navigating a single room of the agent's building. [sent-95, score-0.207]
45 The pO is a transition function that stores a parameterized policy for getting to each door. [sent-96, score-0.155]
46 As well as the camera-lens interrupt described earlier, the robot needs to abort its current activity if the battery is low and should interrupt its patrolling activity if mail arrives for delivery. [sent-100, score-0.651]
47 The PHAM language allows abort conditions to be specified at the point where a subroutine is invoked within a calling routine; those conditions are in force until the subroutine exits. [sent-101, score-0.755]
48 For each abort condition, an "abort handler" state is specified within the calling routine, to which control returns if the condition becomes true. [sent-102, score-0.33]
49 ) Graphically, aborts are depicted as labelled dotted lines (e. [sent-104, score-0.117]
50 , in the DoAll() PHAM in Figure 3), and interrupts are shown as labelled dashed lines with arrows on both ends (e. [sent-106, score-0.212]
51 Some previous research has been done on using memory variables in reinforcement learning in partially observable domains [10]. [sent-110, score-0.202]
52 For an example of memory use in our language, examine the DoDelivery subroutine in Figure l(b), where Z2 is set to another memory value (set in Nav( dest,sp ). [sent-111, score-0.338]
53 Computational functions such as dest in the Nav( dest,sp) subroutine are restricted to be recursive functions taking effectively zero time. [sent-113, score-0.159]
54 A PHAM is assumed to have a finite number of memory variables, Zl, . [sent-114, score-0.107]
55 ,Zn, which can be combined to yield the memory state, Z. [sent-117, score-0.107]
56 The agent can set memory variables by using internal states, which are computational action states with actions in the following format: (set Zl 'l/J(m, 0, x, Z), where 'l/J(m, 0, x, Z) is a function taking the machine, parameter, environment, and memory state as parameters. [sent-119, score-0.794]
57 The transition function, parameter-setting functions, and choice functions take the memory state into account as well. [sent-120, score-0.301]
58 In summary (see also Figure 4): The composition 1-l 0 M of a PHAM 1-l with the underlying MDP M is defined using the cross product of states in 1-l and M. [sent-122, score-0.15]
59 Finally, 1i a M may be reduced to an equivalent SMDP whose states are just the choice points, i. [sent-125, score-0.222]
60 , the joint states where the machine state is a choice state. [sent-127, score-0.351]
61 Definition 1 (Programmable Hierarchical Abstract Machines: PHAMs) A PRAM is a tuple 1i = (IL, 9,8, p, ~, I, ILI, A, ILA, Z, \[1), where IL is the set of machine states in 1i, 9 is the . [sent-129, score-0.148]
62 If 7r is an optimal solution for 1i a M, then the primitive actions specified by 7r constitute an optimal policy for M among those consistent with 1i. [sent-132, score-0.247]
63 First, not all pairs of PHAM and MDP states will be reachable from the initial state; second, the complexity of the induced SMDP is solely determined by the number of reachable choice points. [sent-135, score-0.246]
64 There exists an SMDP called reduce(1i a M) with states C such that the optimal policy for reduce(1i a M) corresponds to an optimal policy for M among those consistent with 1i. [sent-137, score-0.284]
65 Alternatively, and much more simply, we can solve it using online, model-free HAMQ-Iearning [8], which learns directly in the reduced state space of choice points. [sent-139, score-0.202]
66 Starting from a choice state w where the agent takes action a, the agent keeps track of the reward r tot and discount fJtot accumulated on the way to the next choice point, w'. [sent-140, score-0.818]
67 On each step, the agent encounters reward ri and discount fJi (note that fJi is 0 exactly when the agent transitions only in the PHAM and not in the MDP), and updates the totals as follows: rtot ~ rtot + fJtotri; fJtot ~ fJtotfJi . [sent-141, score-0.594]
68 The agent maintains a Q-table, Q(w, a), indexed by choice state and action. [sent-142, score-0.359]
69 When the agent gets to the next choice state, w', it updates the Q-table as follows: Q(w, a) ~ (1 - o:)Q(w, a) + o:[rtot + fJtot max Q(w' , u)] . [sent-143, score-0.296]
70 5 Expressiveness of the PHAM language As shown by Parr [9], the HAM language is at least as expressive as some existing action languages including options [12] and full-fJ models [11]. [sent-146, score-0.681]
71 The PHAM language is substantially more expressive than HAMs. [sent-147, score-0.311]
72 As mentioned earlier, the Deliver-Patrol PHAM program has 9 machines whereas the HAM program requires 63. [sent-148, score-0.2]
73 In general, the additional number of states required to express a PHAM as a pure HAM is IV(Z) x C x 91, where V(Z) is the memory state space, C is the set of possible abort/interrupt contexts, and 9 is the total parameter space. [sent-149, score-0.311]
74 We also developed a PHAM program for the 3,700-state maze world used by Parr and Russell [8]. [sent-150, score-0.13]
75 The HAM used in their experiments had 37 machines; the PHAM program requires only 7. [sent-151, score-0.1]
76 (1) The top two diagrams are of a PRAM fragment with 1 choice state and 3 action states (of which one, labelled d, is the start state). [sent-153, score-0.426]
77 The MDP has 4 states, and action d always leads to state 1 or 4. [sent-154, score-0.182]
78 Note that there are no incoming arcs to the states < c, 2 > or < c, 3 >. [sent-157, score-0.112]
79 There are only 2 states in the reduced SMDP because there are no incoming arcs to the states < c, 2 > or < c, 3 >. [sent-160, score-0.26]
80 With respect to the induced choice points, the Deliver-Patrol PHAM induces 7,816 choice points in the joint SMDP, compared with 38,400 in the original MDP. [sent-167, score-0.185]
81 Figure 5 shows empirical results for the Deliver-Patrol problem, indicating that Q-Iearning with a suitable PHAM program is far faster than flat Q-Iearning. [sent-169, score-0.132]
82 (Parr and Russell observed similar results for the maze world, where HAMQ-Iearning finds a good policy in 270,000 iterations compared to 9,000,000 for flat Q-Iearning. [sent-170, score-0.148]
83 ) Note that equivalent HAM and PHAM programs yield identical reductions in the number of choice points and identical speedups in Q-Iearning. [sent-171, score-0.17]
84 However, this would be akin to arguing that the Java programming language offers nothing over Boolean circuits. [sent-173, score-0.286]
85 Ease of expression and the ability to utilize greater modularity can greatly ease the task of coding reinforcement learning agents that take advantage of prior knowledge. [sent-174, score-0.197]
86 The initial PHAM program was constructed on the assumption that the agent should patrol among A, B, C, D unless there is mail to be delivered. [sent-176, score-0.523]
87 However, the specific rewards are such that the optimal behavior is to loiter in the mail room until mail arrives, thereby avoiding costly delays in mail delivery. [sent-177, score-0.635]
88 The PHAM-Q learning agents learned this optimal behavior by "retargeting" the N av routine to stay in the mail room rather than go to the specified destination. [sent-178, score-0.507]
89 This example demonstrates the difference between constraining behavior through structure and constraining behavior through subgoals: the former method may give the agent greater flexibility but may yield "surprising" results. [sent-179, score-0.285]
90 As expected, the agent learned a suboptimal policy in which N av had the intended meaning of travelling to a specified destination. [sent-181, score-0.361]
91 This experience suggests a natural debugging cycle in which the agent designer may examine learned behaviors and adjust the PHAM program accordingly. [sent-182, score-0.383]
92 The additional features of the PHAM language allow direct expression of programs from other formalisms that are not easily expressed using HAMs. [sent-183, score-0.33]
93 For example, programs in Dietterich's MAXQ language [4] are written easily as PHAMs, but not as HAMs because the MAXQ language allows parameters. [sent-184, score-0.564]
94 The language of teleo-reactive (TR) programs [7, 2] relies on a prioritized set of condition-action rules to achieve a goal. [sent-185, score-0.33]
95 The TR architecture can be implemented directly in PHAMs using the abort mechanism [1]. [sent-187, score-0.124]
96 This requires state abstraction in order to learn choices within PHAMs that are applicable in large classes of circumstances rather than just to each invocation instance separately. [sent-189, score-0.206]
97 Dietterich [4] has derived conditions under which state abstraction can be done within his MAXQ framework without sacrificing recursive optimality (a weaker form of optimality than hierarchical optimality). [sent-190, score-0.308]
98 This decomposition critically depends on the modularity of the programs introduced by the language extensions presented in this paper. [sent-192, score-0.401]
99 Recently, we have added recursion and complex data structures to the PHAM language, incorporating it into a standard programming language (Lisp). [sent-193, score-0.286]
100 Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. [sent-267, score-0.172]
wordName wordTfidf (topN-words)
[('pham', 0.531), ('ham', 0.274), ('language', 0.234), ('smdp', 0.213), ('agent', 0.193), ('phams', 0.177), ('mail', 0.159), ('interrupts', 0.142), ('mdp', 0.14), ('abort', 0.124), ('subroutine', 0.124), ('states', 0.112), ('memory', 0.107), ('interrupt', 0.106), ('routine', 0.106), ('program', 0.1), ('parr', 0.097), ('programs', 0.096), ('reinforcement', 0.095), ('state', 0.092), ('action', 0.09), ('aborts', 0.088), ('hams', 0.088), ('nav', 0.088), ('pram', 0.088), ('programmable', 0.088), ('policy', 0.086), ('room', 0.083), ('abstraction', 0.077), ('expressive', 0.077), ('choice', 0.074), ('handler', 0.071), ('modularity', 0.071), ('patrol', 0.071), ('todoor', 0.071), ('maxq', 0.069), ('russell', 0.065), ('robot', 0.062), ('behaviors', 0.062), ('discount', 0.06), ('calling', 0.06), ('primitive', 0.057), ('hierarchical', 0.057), ('specified', 0.054), ('fjtot', 0.053), ('lens', 0.053), ('patrolling', 0.053), ('rtot', 0.053), ('programming', 0.052), ('actions', 0.05), ('behavior', 0.046), ('dir', 0.046), ('languages', 0.046), ('internal', 0.043), ('il', 0.042), ('reward', 0.042), ('battery', 0.041), ('concise', 0.041), ('arrows', 0.041), ('optimality', 0.041), ('parameterized', 0.041), ('move', 0.041), ('environment', 0.039), ('composition', 0.038), ('stop', 0.038), ('joint', 0.037), ('choices', 0.037), ('specifies', 0.037), ('reduced', 0.036), ('machine', 0.036), ('policies', 0.036), ('dest', 0.035), ('fji', 0.035), ('invoked', 0.035), ('java', 0.035), ('lair', 0.035), ('rooms', 0.035), ('rooto', 0.035), ('subroutines', 0.035), ('unspecified', 0.035), ('worko', 0.035), ('flat', 0.032), ('clean', 0.032), ('agents', 0.031), ('icml', 0.03), ('expressiveness', 0.03), ('ila', 0.03), ('maze', 0.03), ('lest', 0.03), ('reachable', 0.03), ('traverse', 0.03), ('partial', 0.03), ('zl', 0.03), ('enable', 0.03), ('start', 0.029), ('labelled', 0.029), ('gets', 0.029), ('rewards', 0.029), ('learned', 0.028), ('transition', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
2 0.19752252 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
3 0.17212193 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
4 0.17009327 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
5 0.14968166 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
6 0.11051309 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
7 0.10526069 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
8 0.096253783 48 nips-2000-Exact Solutions to Time-Dependent MDPs
9 0.087954149 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
11 0.062870398 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
12 0.055928502 113 nips-2000-Robust Reinforcement Learning
13 0.052503787 6 nips-2000-A Neural Probabilistic Language Model
14 0.050555989 43 nips-2000-Dopamine Bonuses
15 0.04552345 138 nips-2000-The Use of Classifiers in Sequential Inference
16 0.038694374 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites
17 0.033480734 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks
18 0.032538407 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
19 0.032144178 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
20 0.030975185 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
topicId topicWeight
[(0, 0.147), (1, -0.061), (2, 0.111), (3, -0.32), (4, -0.258), (5, 0.018), (6, -0.027), (7, -0.019), (8, -0.023), (9, -0.018), (10, 0.015), (11, 0.002), (12, -0.052), (13, -0.012), (14, -0.06), (15, 0.043), (16, 0.033), (17, -0.022), (18, -0.027), (19, 0.012), (20, -0.075), (21, -0.028), (22, -0.003), (23, 0.005), (24, -0.037), (25, -0.026), (26, 0.107), (27, 0.01), (28, -0.098), (29, -0.01), (30, -0.114), (31, 0.016), (32, 0.078), (33, -0.012), (34, 0.012), (35, 0.142), (36, -0.037), (37, 0.053), (38, 0.002), (39, 0.073), (40, 0.019), (41, 0.045), (42, -0.011), (43, -0.036), (44, 0.035), (45, -0.053), (46, 0.058), (47, 0.044), (48, 0.012), (49, 0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.96222538 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
2 0.9068045 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
3 0.76045626 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
4 0.6788711 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
5 0.66768867 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
6 0.55801266 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
7 0.46572015 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
8 0.42282233 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
9 0.40483874 48 nips-2000-Exact Solutions to Time-Dependent MDPs
10 0.29405296 43 nips-2000-Dopamine Bonuses
12 0.25102797 19 nips-2000-Adaptive Object Representation with Hierarchically-Distributed Memory Sites
13 0.23139556 113 nips-2000-Robust Reinforcement Learning
14 0.22454734 125 nips-2000-Stability and Noise in Biochemical Switches
15 0.21835592 66 nips-2000-Hippocampally-Dependent Consolidation in a Hierarchical Model of Neocortex
16 0.21425724 138 nips-2000-The Use of Classifiers in Sequential Inference
17 0.1929227 96 nips-2000-One Microphone Source Separation
18 0.19014095 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
19 0.18272227 116 nips-2000-Sex with Support Vector Machines
20 0.17917563 6 nips-2000-A Neural Probabilistic Language Model
topicId topicWeight
[(10, 0.05), (17, 0.082), (32, 0.011), (33, 0.035), (55, 0.025), (62, 0.136), (65, 0.034), (67, 0.042), (75, 0.014), (76, 0.016), (77, 0.354), (79, 0.025), (81, 0.014), (90, 0.031), (97, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.85782754 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
2 0.7379927 100 nips-2000-Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks
Author: Richard H. R. Hahnloser, H. Sebastian Seung
Abstract: Ascribing computational principles to neural feedback circuits is an important problem in theoretical neuroscience. We study symmetric threshold-linear networks and derive stability results that go beyond the insights that can be gained from Lyapunov theory or energy functions. By applying linear analysis to subnetworks composed of coactive neurons, we determine the stability of potential steady states. We find that stability depends on two types of eigenmodes. One type determines global stability and the other type determines whether or not multistability is possible. We can prove the equivalence of our stability criteria with criteria taken from quadratic programming. Also, we show that there are permitted sets of neurons that can be coactive at a steady state and forbidden sets that cannot. Permitted sets are clustered in the sense that subsets of permitted sets are permitted and supersets of forbidden sets are forbidden. By viewing permitted sets as memories stored in the synaptic connections, we can provide a formulation of longterm memory that is more general than the traditional perspective of fixed point attractor networks. A Lyapunov-function can be used to prove that a given set of differential equations is convergent. For example, if a neural network possesses a Lyapunov-function, then for almost any initial condition, the outputs of the neurons converge to a stable steady state. In the past, this stability-property was used to construct attractor networks that associatively recall memorized patterns. Lyapunov theory applies mainly to symmetric networks in which neurons have monotonic activation functions [1, 2]. Here we show that the restriction of activation functions to threshold-linear ones is not a mere limitation, but can yield new insights into the computational behavior of recurrent networks (for completeness, see also [3]). We present three main theorems about the neural responses to constant inputs. The first theorem provides necessary and sufficient conditions on the synaptic weight matrix for the existence of a globally asymptotically stable set of fixed points. These conditions can be expressed in terms of copositivity, a concept from quadratic programming and linear complementarity theory. Alternatively, they can be expressed in terms of certain eigenvalues and eigenvectors of submatrices of the synaptic weight matrix, making a connection to linear systems theory. The theorem guarantees that the network will produce a steady state response to any constant input. We regard this response as the computational output of the network, and its characterization is the topic of the second and third theorems. In the second theorem, we introduce the idea of permitted and forbidden sets. Under certain conditions on the synaptic weight matrix, we show that there exist sets of neurons that are
3 0.57865715 37 nips-2000-Convergence of Large Margin Separable Linear Classification
Author: Tong Zhang
Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed
4 0.53881031 81 nips-2000-Learning Winner-take-all Competition Between Groups of Neurons in Lateral Inhibitory Networks
Author: Xiaohui Xie, Richard H. R. Hahnloser, H. Sebastian Seung
Abstract: It has long been known that lateral inhibition in neural networks can lead to a winner-take-all competition, so that only a single neuron is active at a steady state. Here we show how to organize lateral inhibition so that groups of neurons compete to be active. Given a collection of potentially overlapping groups, the inhibitory connectivity is set by a formula that can be interpreted as arising from a simple learning rule. Our analysis demonstrates that such inhibition generally results in winner-take-all competition between the given groups, with the exception of some degenerate cases. In a broader context, the network serves as a particular illustration of the general distinction between permitted and forbidden sets, which was introduced recently. From this viewpoint, the computational function of our network is to store and retrieve memories as permitted sets of coactive neurons. In traditional winner-take-all networks, lateral inhibition is used to enforce a localized, or
5 0.43216097 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
6 0.43029198 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
7 0.42556712 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
8 0.41875455 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
9 0.41046178 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
10 0.40604123 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
11 0.4044629 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
12 0.39203894 80 nips-2000-Learning Switching Linear Models of Human Motion
13 0.39169088 111 nips-2000-Regularized Winnow Methods
14 0.38060728 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
15 0.37613437 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
16 0.3736783 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
17 0.3713859 92 nips-2000-Occam's Razor
18 0.36707905 138 nips-2000-The Use of Classifiers in Sequential Inference
19 0.36670262 48 nips-2000-Exact Solutions to Time-Dependent MDPs
20 0.36460128 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition