nips nips2000 nips2000-48 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Justin A. Boyan, Michael L. Littman
Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. [sent-7, score-0.269]
2 This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. [sent-8, score-0.241]
3 We examine problems based on route planning with public transportation and telescope observation scheduling. [sent-9, score-0.486]
4 1 Introduction Imagine trying to plan a route from home to work that minimizes expected time. [sent-10, score-0.15]
5 One approach is to use a tool such as "Mapquest", which annotates maps with information about estimated driving time, then applies a standard graph-search algorithm to produce a shortest route. [sent-11, score-0.239]
6 Even if driving times are stochastic, the annotations can be expected times, so this presents no additional challenge. [sent-12, score-0.337]
7 However, consider what happens if we would like to include public transportation in our route planning. [sent-13, score-0.217]
8 Buses, trains, and subways vary in their expected travel time according to the time of day: buses and subways come more frequently during rush hour; trains leave on or close to scheduled departure times. [sent-14, score-0.676]
9 In fact, even highway driving times vary with time of day, with heavier traffic and longer travel times during rush hour. [sent-15, score-1.003]
10 To formalize this problem, we require a model that includes both stochastic actions, as in a Markov decision process (MDP), and actions with time-dependent stochastic durations. [sent-16, score-0.374]
11 There are a number of models that include some of these attributes: • Directed graphs with shortest path algorithms [2]: State transitions are deterministic; action durations are time independent (deterministic or stochastic). [sent-17, score-0.505]
12 • Stochastic Time Dependent Networks (STDNS) [6]: State transitions are deterministic; action durations are stochastic and can be time dependent. [sent-18, score-0.599]
13 • Markov decision processes (MDPS) [5]: State transitions are stochastic; action durations are deterministic. [sent-19, score-0.414]
14 • Semi-Markov decision processes (SMDPS) [5]: State transitions are stochastic; action durations are stochastic, but not time dependent. [sent-20, score-0.503]
15 In this paper, we introduce the Time-Dependent MDP (TMDP) model, which generalizes all these models by including both stochastic state transitions and stochastic, time-dependent action durations. [sent-22, score-0.509]
16 At a high level, a TMDP is a special continuousstate MDP [5; 4] consisting of states with both a discrete component and a real-valued time component: (x , t) E X x lR. [sent-23, score-0.135]
17 With absolute time as part of the state space, we can model a rich set of domain objectives including minimizing expected time, maximizing the probability of making a deadline, or maximizing the dollar reward of a path subject to a time deadline. [sent-24, score-0.575]
18 In fact, using the time dimension to represent other one-dimensional quantities, TMDPS support planning with non-linear utilities [3] (e. [sent-25, score-0.24]
19 We define TMDPs and express their Bellman equations in a functional form that gives, at each state x, the one-step lookahead value at (x, t) for all times in parallel (Section 2) . [sent-28, score-0.3]
20 We use the term time-value function to denote a mapping from realvalued times to real-valued future reward. [sent-29, score-0.094]
21 With appropriate restrictions on the form of the stochastic state-time transition function and reward function, we guarantee that the optimal time-value function at each state is a piecewise linear function of time, which can be represented exactly and computed by value iteration (Section 3). [sent-30, score-0.842]
22 We conclude with empirical results on two domains (Section 4). [sent-31, score-0.039]
23 i5 Dnve on backroad HIghway - off peak I~ L4 2 Vz 7~ 12 2 III 8 910 0 1 23 4 5 6 7 REL ~4 122 Highway - rush hour 14"011111 0 1 2 3 4 5 6 7 ~ I I II I II I 7 8 9 10 12 2 Ls I I I 11111 0 1 2 3 4 5 6 7 Ps REL REL Figure 1: An illustrative route-planning example TMDP. [sent-37, score-0.299]
24 From here, two actions are available: a1, taking the 8am train (a scheduled action); and a2, driving to work via highway then backroads (may be done at any time). [sent-40, score-0.576]
25 Action a1 has two possible outcomes, represented by III and 1l2· Outcome III ("Missed the 8am train") is active after 7:50am, whereas outcome 112 ("Caught the train") is active until 7:50am; this is governed by the likelihood functions L1 and L2 in the model. [sent-41, score-0.162]
26 These outcomes cause deterministic transitions to states Xl and X3, respectively, but take varying amounts of time. [sent-42, score-0.226]
27 In the case of catching the train (1l2), the distribution is absolute: the arrival time (shown in P2) has mean 9:45am no matter what time before 7:50am the action was initiated. [sent-44, score-0.563]
28 (Boarding the train earlier does not allow us to arrive at our destination earlier! [sent-45, score-0.106]
29 ) However, missing the train and returning to Xl has a relative distribution: it deterministically takes 15 minutes from our starting time (distribution P1 ) to return home. [sent-46, score-0.229]
30 Outcome Jl3 ("Highway - rush hour") is active with probability 1 during the interval 8am- 9am, and with smaller probability outside that interval, as shown by L 3. [sent-48, score-0.151]
31 Duration distributions P3 and P4 , both relative to the initiation time, show that driving times during rush hour are on average longer than those off peak. [sent-50, score-0.548]
32 The corresponding outcome Jl5 ("Drive on backroad") is insensitive to time of day and results in a deterministic transition to state X3 with duration 1 hour. [sent-53, score-0.563]
33 The reward function for arriving at work is + 1 before 11am and falls linearly to zero between 11am and noon. [sent-54, score-0.184]
34 The solution to a TMDP such as this is a policy mapping state-time pairs (x, t) to actions so as to maximize expected future reward. [sent-55, score-0.218]
35 As is standard in MDP methods, our approach finds this policy via the value function V*. [sent-56, score-0.122]
36 We represent the value function of a TMDP as a set of time-value functions, one per state: ~(t) gives the optimal expected future reward from state Xi at time t. [sent-57, score-0.471]
37 In our example of Figure 1, the time-value functions for X3 and X2 are shown as Va and V2. [sent-58, score-0.053]
38 Because of the deterministic one-hour delay of Jl5, V2 is identical to V3 shifted back one hour. [sent-59, score-0.071]
39 This wholesale shifting of time-value functions is exploited by our solution algorithm. [sent-60, score-0.088]
40 This means the agent can remain in a state for as long as desired at a reward rate of K(x, t) per unit time before choosing an action. [sent-62, score-0.316]
41 This makes it possible, for example, for an agent to wait at home for rush hour to end before driving to work. [sent-63, score-0.543]
42 We can define the optimal value function for a with the following Bellman equations: TMDP in terms of these quantities t' V(x, t) sup (r K(x,s)ds+V(x,t')) value function (allowing dawdling) t'? [sent-65, score-0.167]
43 t it V(x, t) = Q(x, t,a) = max Q(x,t,a) value function (immediate action) L expected Q value over outcomes aEA L(Jllx, a, t) . [sent-66, score-0.208]
44 These equations follow straightforwardly from viewing the TMDP as an undiscounted continuous-time MDP . [sent-68, score-0.049]
45 Note that the calculations of U(Jl, t) are convolutions of the result-time pdf P with the lookahead value R + V. [sent-69, score-0.227]
46 In the next section, we discuss a concrete way of representing and manipulating the continuous quantities that appear in these equations. [sent-70, score-0.07]
47 3 Model with piecewise linear value functions In the general model, the time-value functions for each state can be arbitrarily complex and therefore impossible to represent exactly. [sent-71, score-0.611]
48 In this section, we show how to restrict the model to allow value functions to be manipulated exactly. [sent-72, score-0.137]
49 For each state, we represent its time-value function Vi(t) as a piecewise linear function of time. [sent-73, score-0.349]
50 Vi(t) is thus represented by a data structure consisting of a set of distinct times called breakpoints and, for each pair of consecutive breakpoints, the equation of a line defined over the corresponding interval. [sent-74, score-0.144]
51 Linear timevalue functions provide an exact representation for minimum-time problems. [sent-76, score-0.194]
52 Piecewise time-value functions provide closure under the "max" operator. [sent-77, score-0.103]
53 Rewards must be constrained to be piecewise linear functions of start and arrival times and action durations. [sent-78, score-0.742]
54 We write R(p" t, 8) = Rs(p" t) + Ra(P" t + 8) + Rd(p" 8) where Rs, Ra, and Rd are piecewise linear functions of start time, arrival time, and duration, respectively. [sent-79, score-0.461]
55 In addition, the dawdling reward K and the outcome probability function L must be piecewise constant. [sent-80, score-0.634]
56 The most significant restriction needed for exact computation is that arrival and duration pdfs be discrete. [sent-81, score-0.31]
57 , a uniform distribution) with a piecewise linear time-value function would in general produce a piecewise quadratic timevalue function; further convolutions increase the degree with each iteration of value iteration. [sent-85, score-0.826]
58 The running time of each operation is linear in the representation size of the time-value functions involved. [sent-88, score-0.224]
59 Seeding the process with an initial piecewise linear time-value function, we can carry out value iteration until convergence. [sent-89, score-0.412]
60 In general, the running time from one iteration to the next can increase, as the number of linear "pieces" being manipulated grows; however, the representations grow only as complex as necessary to represent the value function V exactly. [sent-90, score-0.336]
61 4 Experimental domains We present results on two domains: transportation planning and telescope scheduling. [sent-91, score-0.409]
62 For comparison, we also implemented the natural alternative to the piecewiselinear technique: discretizing the time dimension and solving the problem as a standard MDP. [sent-92, score-0.128]
63 Since this paper's focus is on exact computations, we chose a discretization level corresponding to the resolution necessary for exact solution by the MDP at its grid points. [sent-94, score-0.307]
64 An advantage of the MDP is that it is by construction acyclic, so it can be solved by just one sweep of standard value iteration, working backwards in time. [sent-95, score-0.048]
65 The TMDP'S advantage is that it directly manipulates entire linear segments of the time-value functions. [sent-96, score-0.036]
66 1 Transportation planning Figure 2 illustrates an example TMDP for optimizing a commute from San Francisco to NASA Ames. [sent-98, score-0.118]
67 The 14 discrete states model both location and observed traffic Figure 2: The San Francisco to Ames commuting example Q-functions at state 10 ("US1 01 & 8ayshore / heavy traffic") action 0 ("drive to Ames") action 1 ("drive to 8ayshore station") 1. [sent-99, score-0.753]
68 2 6 7 8 9 10 11 12 Optimal policy over time at state 10 a. [sent-108, score-0.271]
69 ; action 0 ("drive to Ames") action 1 ("drive to 8ayshore station") . [sent-109, score-0.374]
70 0 E ::J c: c: o o n ro 6 7 8 9 10 11 12 time Figure 3: The optimal Q-value functions and policy at state #10. [sent-110, score-0.36]
71 conditions: shaded and unshaded circles represent heavy and light traffic, respectively. [sent-111, score-0.112]
72 Observed transition times and traffic conditions are stochastic, and depend on both the time and traffic conditions at the originating location. [sent-112, score-0.475]
73 At states 5, 6, 11, and 12, the "catch the train" action induces an absolute arrival distribution reflecting the train schedules. [sent-113, score-0.441]
74 The domain objective is to arrive at Ames by 9:00am. [sent-114, score-0.037]
75 We impose a linear penalty for arriving between 9 and noon, and an infinite penalty for arriving after noon. [sent-115, score-0.166]
76 There are also linear penalties on the number of minutes spent driving in light traffic, driving in heavy traffic, and riding on the train; the coefficients of these penalties can be adjusted to reflect the commuter's tastes. [sent-116, score-0.637]
77 Figure 3 presents the optimal time-value functions and policy for state #10, "US101&Bayshore; / heavy traffic. [sent-117, score-0.35]
78 " There are two actions from this state, corresponding to driving directly to Ames and driving to the train station to wait for the next train. [sent-118, score-0.724]
79 Driving to the train station is preferred (has higher Q-value) at times that are close- but not too close! [sent-119, score-0.287]
80 The full domain is solved in well under a second by both solvers (see Table 1). [sent-121, score-0.037]
81 The optimal time-value functions in the solution comprise a total of 651 linear segments. [sent-122, score-0.16]
82 2 Telescope observation scheduling Next, we consider the problem of scheduling astronomical targets for a telescope to maximize the scientific return of one night's viewing [1]. [sent-124, score-0.394]
83 We are given N possible targets with associated coordinates, scientific value, and time window of visibility. [sent-125, score-0.125]
84 We assume that the reward of an observation is proportional to the duration of viewing the target. [sent-127, score-0.295]
85 Acquiring a target requires two steps of stochastic duration: moving the telescope, taking time roughly proportional to the distance traveled; and calibrating it on the new target. [sent-128, score-0.222]
86 Previous approaches have dealt with this stochasticity heuristically, using a just-incase scheduling approach [1]. [sent-129, score-0.118]
87 Here, we model the stochasticity directly within the TMDP framework. [sent-130, score-0.039]
88 The TMDP has N + 1 states (corresponding to the N observations and "off") and N actions per state (corresponding to what to observe next). [sent-131, score-0.179]
89 The three rightmost columns measure solution complexity in terms of the number of sweeps of value iteration before convergence; the number of distinct "pieces" or values in the optimal value function V*; and the running time. [sent-143, score-0.304]
90 Running times are the median of five runs on an UltraSparc II (296MHz CPU, 256Mb RAM). [sent-144, score-0.094]
91 dawdling reward rate K(x, t) encodes the scientific value of observing x at time t; that value is 0 at times when x is not visible. [sent-145, score-0.56]
92 Relative duration distributions encode the inter-target distances and stochastic calibration times on each transition. [sent-146, score-0.354]
93 Thus, representing the exact solution with a grid required 1301 time bins per state. [sent-150, score-0.305]
94 5 Conclusions In sum, we have presented a new stochastic model for time-dependent MDPS (TMDPS), discussed applications, and shown that dynamic programming with piecewise linear time-value functions can produce optimal policies efficiently. [sent-152, score-0.572]
95 In initial comparisons with the alternative method of discretizing the time dimension, the TMDP approach was empirically faster, used significantly less memory, and solved the problem exactly over continuous t E lR rather than just at grid points. [sent-153, score-0.253]
96 In our exact computation model, the requirement of discrete duration distributions seems particularly restrictive. [sent-154, score-0.264]
97 We are currently investigating a way of using our exact algorithm to generate upper and lower bounds on the optimal solution for the case of arbitrary pdfs. [sent-155, score-0.162]
98 This may allow the system to produce an optimal or provably near-optimal policy without having to identify all the twists and turns in the optimal time-value functions. [sent-156, score-0.18]
99 Perhaps the most important advantage of the piecewise linear representation will turn out to be its amenability to bounding and approximation methods. [sent-157, score-0.316]
100 We hope that such advances will enable the solution of city-sized route planning, more realistic telescope scheduling, and other practical time-dependent stochastic problems. [sent-158, score-0.392]
wordName wordTfidf (topN-words)
[('tmdp', 0.403), ('piecewise', 0.28), ('driving', 0.205), ('rel', 0.202), ('action', 0.187), ('highway', 0.151), ('rush', 0.151), ('telescope', 0.151), ('traffic', 0.146), ('jl', 0.135), ('stochastic', 0.133), ('duration', 0.127), ('abs', 0.126), ('dawdling', 0.126), ('mdp', 0.123), ('reward', 0.119), ('planning', 0.118), ('ames', 0.109), ('outcome', 0.109), ('durations', 0.109), ('state', 0.108), ('train', 0.106), ('pit', 0.101), ('transportation', 0.101), ('hour', 0.098), ('times', 0.094), ('vi', 0.093), ('arrival', 0.092), ('exact', 0.091), ('grid', 0.09), ('time', 0.089), ('station', 0.087), ('tit', 0.085), ('transitions', 0.081), ('scheduling', 0.079), ('heavy', 0.079), ('pdf', 0.079), ('tmdps', 0.076), ('policy', 0.074), ('outcomes', 0.074), ('route', 0.073), ('actions', 0.071), ('deterministic', 0.071), ('drive', 0.071), ('pieces', 0.065), ('arriving', 0.065), ('bellman', 0.059), ('day', 0.059), ('absolute', 0.056), ('park', 0.055), ('functions', 0.053), ('backroad', 0.05), ('breakpoints', 0.05), ('buses', 0.05), ('closure', 0.05), ('convolutions', 0.05), ('jllx', 0.05), ('lookahead', 0.05), ('missed', 0.05), ('night', 0.05), ('subways', 0.05), ('timevalue', 0.05), ('wait', 0.05), ('viewing', 0.049), ('iteration', 0.048), ('value', 0.048), ('discrete', 0.046), ('running', 0.046), ('mdps', 0.046), ('public', 0.043), ('nasa', 0.043), ('departure', 0.043), ('scheduled', 0.043), ('boyan', 0.043), ('caught', 0.043), ('sweeps', 0.043), ('domains', 0.039), ('path', 0.039), ('home', 0.039), ('travel', 0.039), ('discretizing', 0.039), ('penalties', 0.039), ('stochasticity', 0.039), ('expected', 0.038), ('decision', 0.037), ('domain', 0.037), ('manipulated', 0.036), ('aaai', 0.036), ('linear', 0.036), ('optimal', 0.036), ('scientific', 0.036), ('continuous', 0.035), ('solution', 0.035), ('quantities', 0.035), ('minutes', 0.034), ('restrictions', 0.034), ('vary', 0.034), ('produce', 0.034), ('represent', 0.033), ('ra', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 48 nips-2000-Exact Solutions to Time-Dependent MDPs
Author: Justin A. Boyan, Michael L. Littman
Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1
2 0.16687599 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
3 0.15822968 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
Author: Jakob Carlström
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission control of self-similar call traffic is used to demonstrate the technique. The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem. 1
4 0.13410461 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier
Abstract: We propose a method of approximate dynamic programming for Markov decision processes (MDPs) using algebraic decision diagrams (ADDs). We produce near-optimal value functions and policies with much lower time and space requirements than exact dynamic programming. Our method reduces the sizes of the intermediate value functions generated during value iteration by replacing the values at the terminals of the ADD with ranges of values. Our method is demonstrated on a class of large MDPs (with up to 34 billion states), and we compare the results with the optimal value functions.
5 0.12522967 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
6 0.097281851 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
7 0.096361704 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
8 0.096253783 105 nips-2000-Programmable Reinforcement Learning Agents
9 0.094189316 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
10 0.078423075 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
11 0.07081981 43 nips-2000-Dopamine Bonuses
12 0.058198266 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
13 0.055907823 133 nips-2000-The Kernel Gibbs Sampler
14 0.053333063 79 nips-2000-Learning Segmentation by Random Walks
15 0.050218686 113 nips-2000-Robust Reinforcement Learning
16 0.048766494 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
17 0.048696779 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors
18 0.047974132 30 nips-2000-Bayesian Video Shot Segmentation
19 0.047226645 55 nips-2000-Finding the Key to a Synapse
20 0.045220904 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
topicId topicWeight
[(0, 0.173), (1, -0.063), (2, 0.089), (3, -0.273), (4, -0.185), (5, 0.015), (6, -0.021), (7, 0.048), (8, -0.057), (9, -0.024), (10, -0.039), (11, 0.028), (12, 0.008), (13, 0.022), (14, -0.04), (15, 0.043), (16, -0.004), (17, 0.029), (18, -0.033), (19, 0.025), (20, -0.007), (21, 0.049), (22, -0.059), (23, 0.019), (24, -0.086), (25, 0.065), (26, -0.047), (27, -0.076), (28, 0.157), (29, -0.002), (30, 0.206), (31, 0.105), (32, -0.017), (33, -0.079), (34, 0.05), (35, -0.09), (36, -0.069), (37, -0.112), (38, -0.019), (39, 0.001), (40, -0.035), (41, -0.126), (42, -0.097), (43, 0.183), (44, 0.03), (45, 0.027), (46, 0.061), (47, 0.112), (48, -0.027), (49, -0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.96381629 48 nips-2000-Exact Solutions to Time-Dependent MDPs
Author: Justin A. Boyan, Michael L. Littman
Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1
2 0.71211135 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
Author: Jakob Carlström
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission control of self-similar call traffic is used to demonstrate the technique. The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem. 1
3 0.69473112 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
Author: Robert St-Aubin, Jesse Hoey, Craig Boutilier
Abstract: We propose a method of approximate dynamic programming for Markov decision processes (MDPs) using algebraic decision diagrams (ADDs). We produce near-optimal value functions and policies with much lower time and space requirements than exact dynamic programming. Our method reduces the sizes of the intermediate value functions generated during value iteration by replacing the values at the terminals of the ADD with ranges of values. Our method is demonstrated on a class of large MDPs (with up to 34 billion states), and we compare the results with the optimal value functions.
4 0.54092157 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
5 0.48175558 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
6 0.46608305 113 nips-2000-Robust Reinforcement Learning
7 0.43279985 105 nips-2000-Programmable Reinforcement Learning Agents
8 0.43181714 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
9 0.3993724 30 nips-2000-Bayesian Video Shot Segmentation
10 0.37833449 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
11 0.35496366 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning
12 0.32365382 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
13 0.31148475 125 nips-2000-Stability and Noise in Biochemical Switches
14 0.30366024 43 nips-2000-Dopamine Bonuses
15 0.30069879 85 nips-2000-Mixtures of Gaussian Processes
16 0.26880077 148 nips-2000-`N-Body' Problems in Statistical Learning
17 0.26343468 79 nips-2000-Learning Segmentation by Random Walks
18 0.26035425 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
19 0.25751472 34 nips-2000-Competition and Arbors in Ocular Dominance
20 0.24860963 55 nips-2000-Finding the Key to a Synapse
topicId topicWeight
[(10, 0.041), (15, 0.389), (17, 0.101), (32, 0.014), (33, 0.051), (55, 0.021), (62, 0.098), (65, 0.023), (67, 0.045), (75, 0.036), (76, 0.02), (79, 0.013), (81, 0.029), (90, 0.022), (97, 0.013)]
simIndex simValue paperId paperTitle
1 0.86984795 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes
Author: Jakob Carlström
Abstract: This paper presents predictive gain scheduling, a technique for simplifying reinforcement learning problems by decomposition. Link admission control of self-similar call traffic is used to demonstrate the technique. The control problem is decomposed into on-line prediction of near-future call arrival rates, and precomputation of policies for Poisson call arrival processes. At decision time, the predictions are used to select among the policies. Simulations show that this technique results in significantly faster learning without any performance loss, compared to a reinforcement learning controller that does not decompose the problem. 1
same-paper 2 0.83556485 48 nips-2000-Exact Solutions to Time-Dependent MDPs
Author: Justin A. Boyan, Michael L. Littman
Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1
3 0.36535189 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning
Author: Christian R. Shelton
Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.
4 0.36391851 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
Author: Javier R. Movellan, Paul Mineiro, Ruth J. Williams
Abstract: This paper explores a framework for recognition of image sequences using partially observable stochastic differential equation (SDE) models. Monte-Carlo importance sampling techniques are used for efficient estimation of sequence likelihoods and sequence likelihood gradients. Once the network dynamics are learned, we apply the SDE models to sequence recognition tasks in a manner similar to the way Hidden Markov models (HMMs) are commonly applied. The potential advantage of SDEs over HMMS is the use of continuous state dynamics. We present encouraging results for a video sequence recognition task in which SDE models provided excellent performance when compared to hidden Markov models. 1
5 0.36367694 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm
Author: Anders Jonsson, Andrew G. Barto
Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.
6 0.359216 80 nips-2000-Learning Switching Linear Models of Human Motion
7 0.35834402 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams
8 0.35722324 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
9 0.35536966 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
10 0.35358405 30 nips-2000-Bayesian Video Shot Segmentation
11 0.35063338 79 nips-2000-Learning Segmentation by Random Walks
12 0.35056314 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
13 0.34983736 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
14 0.3494373 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
15 0.34934846 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
16 0.34548438 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region
17 0.34455383 74 nips-2000-Kernel Expansions with Unlabeled Examples
18 0.34345362 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
19 0.34326005 22 nips-2000-Algorithms for Non-negative Matrix Factorization
20 0.34277529 92 nips-2000-Occam's Razor