nips nips2000 nips2000-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. [sent-4, score-0.539]
2 Examples of such problems include agents with multiple goals and agents with multiple users. [sent-5, score-0.276]
3 Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. [sent-6, score-0.348]
4 We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. [sent-7, score-0.835]
5 We then present an new algorithm for finding a solution and results on simulated environments. [sent-8, score-0.135]
6 1 Introduction In the traditional reinforcement learning framework, the learning agent is given a single scalar value of reward at each time step. [sent-9, score-0.761]
7 The goal is for the agent to optimize the sum of these rewards over time (the return). [sent-10, score-0.575]
8 We might construct the reward signal to be the total number of people paying attention to the system. [sent-13, score-0.327]
9 However, a reward signal of 2 ignores important information about which two users are watching. [sent-14, score-0.39]
10 The users of the system change as people leave and enter the room. [sent-15, score-0.217]
11 We know which users are contributing to the reward and that only present users can contribute. [sent-18, score-0.524]
12 In other cases, the multiple sources aren't users, but goals. [sent-19, score-0.235]
13 In order to keep from having to relearn the solution from scratch each time the weighting is changed, we need to keep track of which rewards to attribute to which goals. [sent-23, score-0.29]
14 There is a separate difficulty if the rewards are not designed functions of the state but rather are given by other agents or people in the environment. [sent-24, score-0.387]
15 One user may decide to reward the system with values twice as large as those of another which should not result in that user having twice the control over the entertainment. [sent-27, score-0.381]
16 If the users of the system know they are training it, they will employ all kinds of reward strategies to try to steer the system to the desired behavior [2]. [sent-29, score-0.44]
17 By keeping track of the sources of the rewards, we will derive an algorithm to overcome these difficulties. [sent-30, score-0.199]
18 1 Related Work The work presented here is related to recent work on multiagent reinforcement learning [1,4,5,7] in that multiple rewards signals are present and game theory provides a solution. [sent-32, score-0.477]
19 Work in multiple goals (see [3, 8] as examples) is also related but assumes either that the returns of the goals are to be linearly combined for an overall value function or that only one goal is to be solved at a time. [sent-34, score-0.207]
20 7f( x, a) is the policy or probability the agent will take action a when observing x. [sent-40, score-0.684]
21 At each time step, the agent receives a set of rewards (one for each source in the environment), Ts (t ) is the reward at time t from source s. [sent-41, score-1.661]
22 We use the average reward formulation and so R; = limn--->CXl ~E [r· s(1) + Ts(2 ) + . [sent-42, score-0.26]
23 + Ts(n )I7f] is the expected return from source s for following policy 7f. [sent-45, score-0.781]
24 We will also assume that the algorithm knows the set of sources present at each time step. [sent-47, score-0.222]
25 Sources which are not present provide a constant reward, regardless of the state or action, which we will assume to be zero. [sent-48, score-0.133]
26 All sums over sources will be assumed to be taken over only the present sources. [sent-49, score-0.193]
27 The goal is to produce an algorithm that will produce a policy based on previous experience and the sources present. [sent-50, score-0.587]
28 Each experience is a sequence of observations, action, and reward triplets for a particular run of a particular policy. [sent-52, score-0.299]
29 1 Policy Votes If rewards are not directly comparable, we need to find a property of the sources which is comparable and a metric to optimize. [sent-54, score-0.419]
30 We begin by noting that we want to limit the amount of control any given source has over the behavior of the agent. [sent-55, score-0.427]
31 To that end, we construct the policy as the average of a set of votes, one for each source present. [sent-56, score-0.706]
32 The votes for a source must sum to 1 and must all be non-negative (thus giving each source an equal "say" in the agent's policy). [sent-57, score-1.147]
33 We will first consider restricting the rewards from a given source to only affect the votes for that source. [sent-58, score-0.903]
34 The form for the policy is therefore (1) ° where for each present source 8, L xa s(x ) = 1, as (x) ~ for all x , L avs (x, a) = 1 for all x , and Vs (x, a) ~ for all x and a. [sent-59, score-0.729]
35 We have broken apart the vote from a source into two parts, a and v . [sent-60, score-0.628]
36 as (x ) is how much effort source 8 is putting into affecting the policy for observation x . [sent-61, score-0.806]
37 vs (x, a) is the vote by source 8 for the policy for observation x. [sent-62, score-1.124]
38 Mathematically this is the same as constructing a single vote (v~(x, a) = as (x )vs (x, a) , but we find a and v to be more interpretable. [sent-63, score-0.252]
39 ° We have constrained the total effort and vote anyone source can apply. [sent-64, score-0.653]
40 Unfortunately, these votes are not quite the correct parameters for our policy. [sent-65, score-0.293]
41 They are not invariant to the other sources present. [sent-66, score-0.17]
42 To illustrate this consider the example of a single state with two actions, two sources, and a learning agent with the voting method from above. [sent-67, score-0.547]
43 If 8 1 prefers only a1 and 82 likes an equal mix of a1 and a2, the agent will learn a vote of (1,0) for 81 and 82 can reward the agent to cause it to learn a vote of (0,1) for 82 resulting in a policy of (0. [sent-68, score-1.702]
44 Whether this is the correct final policy depends on the problem definition. [sent-71, score-0.279]
45 The policy reverts to (0 , 1) which is far from 82 'S (the only present source's) desired (0. [sent-73, score-0.302]
46 5) Clearly, the learned votes for 82 are meaningless when 8 1 is not present. [sent-75, score-0.293]
47 Thus, while the voting scheme does limit the control each present source has over the agent, it does not provide a description of the source's preferences which would allow for the removal or addition (or reweighting) of sources. [sent-76, score-0.573]
48 2 Returns as Preferences While rewards (or returns) are not comparable across sources, they are comparable within a source. [sent-78, score-0.259]
49 In particular, we know that if R;l > R ;2 that source 8 prefers policy 'if1 to policy 'if2 . [sent-79, score-1.045]
50 We do not know how to weigh that preference against a different source's preference so an explicit tradeoff is still impossible, but we can limit (using the voting scheme of equation 1) how much one source's preference can override another source's preference. [sent-80, score-0.433]
51 We allow a source's preference for a change to prevail in as much as its votes are sufficient to affect the change in the presences of the other sources' votes. [sent-81, score-0.489]
52 We have a type of a general-sum game (letting the sources be the players of game theory jargon). [sent-82, score-0.342]
53 The value to source 8 ' of the set of all sources' votes is R;, where 'if is the function of the votes defined in equation 1. [sent-83, score-1.039]
54 Each source 8 ' would like to set its particular votes, as, (x) and v~ (x, a) to maximize its value (or return). [sent-84, score-0.472]
55 Our algorithm will set each source's vote in this way thus insuring that no source could do better by "lying" about its true reward function . [sent-85, score-0.917]
56 In game theory, a "solution" to such a game is called a Nash Equilibrium [6], a point at which each player (source) is playing (voting) its best response to the other players. [sent-86, score-0.278]
57 At a Nash Equilibrium, no single player can change its play and achieve a gain. [sent-87, score-0.132]
58 Because the votes are real-valued, we are looking for the equilibrium of a continuous game. [sent-88, score-0.396]
59 We will derive a fictitious play algorithm to find an equilibrium for this game. [sent-89, score-0.184]
60 To do that, we will pick a parametric form for R; (the estimate of the return): linear in the KL-divergence between a target vote and 1L Letting as. [sent-92, score-0.201]
61 Just as a s (x) was the amount of vote source s was putting towards the policy for observation x, f3s (x ) is the importance for source s of the policy for observation x . [sent-94, score-1.758]
62 And, while Vs (x, a) was the policy vote for observation x for source s, ps(x, a) is the preferred policy for observation x for source s. [sent-95, score-1.709]
63 2 Best Response Algorithm To produce an algorithm for finding a Nash Equilibrium, let us first start by deriving an algorithm for finding the best response for source s to a set of votes. [sent-106, score-0.64]
64 We need to find the set of as (x) and Vs (x, a) that satisfy the constraints on the votes and maximize equation 2 which is the same as minimizing ,,"", f-! [sent-107, score-0.393]
65 We can find the best response for source s by iterating between the two steps above. [sent-117, score-0.519]
66 , ':'s(5,a) • v => , 7f(5, a) Figure 2: Transfer of the load-unload solution: plots of the same values as in figure 1 but with the left source absent No additional learning was allowed (the left side plots are the same), The votes, however, change, and thus so does the final policy, 3. [sent-124, score-0.513]
67 We follow 7f for a fixed number of time steps and record the average reward for each source. [sent-130, score-0.26]
68 We add these average rewards and the empirical estimate of the policy followed as data to the least-squares estimate of the returns. [sent-131, score-0.462]
69 The hidden state is whether the agent is currently carrying cargo. [sent-136, score-0.503]
70 Whenever the agent enters the top state (state 1), cargo is placed on the agent Whenever the agent arrives in any of the boxed states while carrying cargo, the cargo is removed and the agent receives reward. [sent-137, score-1.965]
71 For each boxed state, there is one reward source who only rewards for deliveries to that state (a reward of 1 for a delivery and 0 for all other time steps). [sent-138, score-1.395]
72 In state 5, the agent has the choice of four actions each of which moves the agent to the corresponding state without error. [sent-139, score-1.053]
73 Since the agent cannot observe neither whether it Figure 3: One-way door state diagram: At every state there are two actions (right and left) available to the agent. [sent-140, score-0.714]
74 In states 1,9, 10, and 15 where there are only single outgoing edges, both actions follow the same edge. [sent-141, score-0.166]
75 Source 1 rewards entering state 1 whereas source 2 rewards entering state 9. [sent-144, score-1.079]
76 ~ (3s (x ) =} (ts(x) =} 81 82 Ps(x, right) Vs (x , right) n(x, right) Figure 4: One-way door solution: from left to right: the sources' ideal policies, the votes, and the final agent's policy. [sent-162, score-0.132]
77 Light bars are for states for which both actions lead to the same state. [sent-163, score-0.143]
78 has cargo nor its history, the optimal policy for state 5 is stochastic. [sent-164, score-0.495]
79 The algorithm set all (t- and {3-values to 0 for states other than state 5. [sent-165, score-0.202]
80 We ran for 300 epochs of 200 iterations by which point the algorithm consistently settled on the solution shown in figure 1. [sent-169, score-0.152]
81 For each source, the algorithm found the best solution of randomly picking between the load state and the source's delivery state (as shown by the p-values). [sent-170, score-0.43]
82 The votes are heavily weighted towards the delivery actions to overcome the other sources' preferences resulting in an approximately uniform policy. [sent-171, score-0.536]
83 The important point is that, without additional learning, the policy can be changed if the left source leaves. [sent-172, score-0.774]
84 The learned (t - and p-values are kept the same, but the Nash Equilibrium is different resulting in the policy in figure 2. [sent-173, score-0.279]
85 From each state the agent can move to the left or right except in states 1, 9, 10, and 15 where there is only one possible action. [sent-176, score-0.671]
86 Once the agent enters states 1 or 9, it may not pass back through except by going around through state 5. [sent-178, score-0.568]
87 Source 1 gives reward when the agent passes through state 1. [sent-179, score-0.759]
88 Source 2 gives reward when the agent passes through state 9. [sent-180, score-0.759]
89 Source 1 considers the left-side states (2-5 and 11-12) the most important while source 2 considers the right-side states (5-8 and 13-14) the most important. [sent-187, score-0.553]
90 The ideal policies captured by the p-values show that source 1 wants the agent to move left and source 2 wants the agent to move right for the upper states (2-8) while the sources agree that for the lower states (11-14) the agent should move towards state 5. [sent-188, score-2.802]
91 Both sources spend most of their vote on state 5, the state they both feel is important and on which they disagree. [sent-190, score-0.646]
92 The other states (states for which only one source has a strong opinion or on which they agree), they do not need to spend much of their vote. [sent-191, score-0.545]
93 The resulting policy is the natural one: in state 5, the agent randomly picks a direction after which, the agent moves around the chosen loop quickly to return to state 5. [sent-192, score-1.356]
94 Just as in the load-unload problem, if we remove one source, the agent automatically adapts to the ideal policy for the remaining source (with only one source, So, present, 7f(x, a) = P SQ (x, a)). [sent-193, score-1.137]
95 Estimating the optimal policies and then taking the mixture of these two policies would produce a far worse result. [sent-194, score-0.186]
96 For states 2-8, both sources would have differing opinions and the mixture model would produce a uniform policy in those states; the agent would spend most of its time near state 5. [sent-195, score-1.145]
97 Constructing a reward signal that is the sum of the sources' rewards does not lead to a good solution either. [sent-196, score-0.521]
98 The agent will find that circling either the left or right loop is optimal and will have no incentive to ever travel along the other loop. [sent-197, score-0.506]
99 5 Conclusions It is difficult to conceive of a method for providing a single reward signal that would result in the solution shown in figure 4 and still automatically change when one of the reward sources was removed. [sent-198, score-0.885]
100 However, we would like to be able to extend the load-unload result to the situation where the agent has a memory bit. [sent-201, score-0.387]
wordName wordTfidf (topN-words)
[('source', 0.427), ('agent', 0.364), ('votes', 0.293), ('policy', 0.279), ('reward', 0.26), ('vote', 0.201), ('nash', 0.191), ('rewards', 0.183), ('sources', 0.17), ('vs', 0.169), ('ps', 0.121), ('state', 0.11), ('preference', 0.11), ('users', 0.107), ('cargo', 0.106), ('equilibrium', 0.103), ('delivery', 0.091), ('game', 0.086), ('reinforcement', 0.083), ('actions', 0.08), ('return', 0.075), ('multiple', 0.065), ('policies', 0.064), ('boxed', 0.064), ('states', 0.063), ('ts', 0.058), ('solution', 0.055), ('spend', 0.055), ('returns', 0.05), ('bs', 0.05), ('door', 0.05), ('preferences', 0.05), ('voting', 0.05), ('agents', 0.05), ('move', 0.049), ('observation', 0.048), ('goals', 0.046), ('people', 0.044), ('singh', 0.043), ('left', 0.043), ('change', 0.043), ('contracts', 0.042), ('isbell', 0.042), ('player', 0.042), ('sbottom', 0.042), ('shelton', 0.042), ('right', 0.042), ('action', 0.041), ('kearns', 0.039), ('experience', 0.039), ('ideal', 0.039), ('old', 0.039), ('comparable', 0.038), ('ran', 0.037), ('balancing', 0.037), ('entertainment', 0.037), ('multiagent', 0.037), ('produce', 0.035), ('best', 0.035), ('prefers', 0.033), ('entering', 0.033), ('scalar', 0.031), ('epochs', 0.031), ('wants', 0.031), ('agree', 0.031), ('bright', 0.031), ('enters', 0.031), ('algorithm', 0.029), ('response', 0.029), ('loop', 0.029), ('carrying', 0.029), ('find', 0.028), ('finding', 0.028), ('automatically', 0.028), ('optimize', 0.028), ('putting', 0.027), ('whenever', 0.027), ('know', 0.027), ('equation', 0.026), ('keep', 0.026), ('letting', 0.026), ('effort', 0.025), ('passes', 0.025), ('moves', 0.025), ('twice', 0.025), ('changed', 0.025), ('constraints', 0.024), ('play', 0.024), ('impose', 0.024), ('user', 0.024), ('present', 0.023), ('gradient', 0.023), ('would', 0.023), ('diagram', 0.023), ('ap', 0.023), ('system', 0.023), ('signal', 0.023), ('single', 0.023), ('maximize', 0.022), ('towards', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
2 0.26115081 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
3 0.18983559 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
4 0.17212193 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
5 0.17152911 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
6 0.16464728 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
7 0.16463351 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
8 0.14709416 96 nips-2000-One Microphone Source Separation
9 0.12856671 43 nips-2000-Dopamine Bonuses
10 0.12522967 48 nips-2000-Exact Solutions to Time-Dependent MDPs
11 0.11210589 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
12 0.10588822 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
13 0.1021762 113 nips-2000-Robust Reinforcement Learning
14 0.096393175 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
15 0.068812408 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
16 0.058378749 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
17 0.048223864 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration
18 0.045057222 49 nips-2000-Explaining Away in Weight Space
19 0.039606396 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
20 0.037750863 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning
topicId topicWeight
[(0, 0.182), (1, -0.104), (2, 0.162), (3, -0.367), (4, -0.315), (5, -0.023), (6, -0.132), (7, -0.011), (8, -0.026), (9, -0.064), (10, -0.031), (11, -0.065), (12, -0.121), (13, 0.003), (14, -0.071), (15, -0.013), (16, -0.03), (17, -0.084), (18, -0.054), (19, -0.074), (20, -0.06), (21, 0.078), (22, 0.074), (23, -0.04), (24, 0.121), (25, 0.012), (26, 0.037), (27, -0.045), (28, -0.133), (29, 0.201), (30, 0.029), (31, -0.025), (32, 0.016), (33, 0.034), (34, -0.01), (35, 0.015), (36, -0.006), (37, -0.043), (38, 0.004), (39, -0.055), (40, -0.026), (41, -0.019), (42, -0.101), (43, 0.016), (44, -0.078), (45, -0.016), (46, -0.038), (47, -0.007), (48, 0.059), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.98471802 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
2 0.72798097 105 nips-2000-Programmable Reinforcement Learning Agents
Author: David Andre, Stuart J. Russell
Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.
3 0.66716206 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
4 0.65387917 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
5 0.59822589 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
6 0.52771157 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
7 0.51558387 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
8 0.50827861 48 nips-2000-Exact Solutions to Time-Dependent MDPs
9 0.47883508 96 nips-2000-One Microphone Source Separation
10 0.45137563 43 nips-2000-Dopamine Bonuses
11 0.42222598 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
12 0.40481439 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
13 0.38681483 113 nips-2000-Robust Reinforcement Learning
14 0.29968926 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
15 0.2356317 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration
16 0.2192052 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
17 0.21564388 138 nips-2000-The Use of Classifiers in Sequential Inference
18 0.20248318 44 nips-2000-Efficient Learning of Linear Perceptrons
19 0.19911346 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
20 0.18630831 125 nips-2000-Stability and Noise in Biochemical Switches
topicId topicWeight
[(10, 0.034), (17, 0.085), (32, 0.019), (33, 0.054), (36, 0.016), (55, 0.026), (58, 0.275), (62, 0.175), (65, 0.033), (67, 0.045), (76, 0.024), (79, 0.028), (81, 0.022), (90, 0.028), (97, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.85429585 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
2 0.62876314 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
3 0.60237211 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
Author: Geoffrey J. Gordon
Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1
4 0.59458613 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
Author: Igor V. Cadez, Padhraic Smyth
Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of
5 0.57218498 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan
Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1
6 0.5623849 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
7 0.56234801 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
8 0.52927941 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
9 0.51535678 105 nips-2000-Programmable Reinforcement Learning Agents
10 0.51071191 80 nips-2000-Learning Switching Linear Models of Human Motion
11 0.49281824 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
12 0.48875564 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
13 0.48764876 92 nips-2000-Occam's Razor
14 0.48062265 48 nips-2000-Exact Solutions to Time-Dependent MDPs
15 0.47717634 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
16 0.47662416 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
17 0.47576258 138 nips-2000-The Use of Classifiers in Sequential Inference
18 0.47045237 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
19 0.46248266 43 nips-2000-Dopamine Bonuses
20 0.46182352 22 nips-2000-Algorithms for Non-negative Matrix Factorization