nips nips2001 nips2001-175 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We address the problem of non-convergence of online reinforcement learning algorithms (e. [sent-5, score-0.191]
2 , Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. [sent-7, score-0.302]
3 An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. [sent-9, score-0.193]
4 We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. [sent-11, score-0.34]
5 1 Introduction Function approximation is essential for applying value-function-based reinforcement learning (RL) algorithms to solve large Markov decision problems (MDPs). [sent-13, score-0.153]
6 However, online RL algorithms such as SARSA(A) have been shown experimentally to have difficulty converging when applied with function approximators. [sent-14, score-0.068]
7 ) The heart of the problem is that the approximate values of different states (e. [sent-17, score-0.084]
8 The optimal policy at state 81 may require increasing a parameter, while the optimal policy at state 82 may require decreasing it. [sent-20, score-0.714]
9 One approach is to formulate the reinforcement learning problem as a global search through a space of parameterized policies as in the policy gradient algorithms (Williams, 1992; Sutton, McAllester, Singh, & Mansour, 2000; Konda & Tsitsiklis, 2000; Baxter & Bartlett, 2000). [sent-23, score-0.388]
10 This avoids the oscillation problem, but the resulting algorithms are slow and only converge to local optima. [sent-24, score-0.06]
11 We pursue an alternative approach that formulates the function approximation problem as a global supervised learning problem. [sent-25, score-0.055]
12 During the exploration phase, a support set of points is constructed whose optimal values are known within some tolerance. [sent-28, score-0.28]
13 In the function fitting phase, a function approximator V is fit to the support set. [sent-29, score-0.262]
14 First, we replace the support set with the set of states that lie along the best paths found during exploration. [sent-31, score-0.313]
15 Second, we employ a combined error function that includes terms for the supervised error, the Bellman error, and the advantage error (Baird, 1995) into the function fitting process. [sent-32, score-0.176]
16 The resulting BFBP (Batch Fit to Best Paths) method gives significantly better performance on resource-constrained scheduling problems as well as on the mountain car toy benchmark problem. [sent-33, score-0.417]
17 Let s' == a(s) denote the state s' that results from performing a in s and r(a, s) denote the one-step reward. [sent-35, score-0.119]
18 Both GROWSUPPORT and ROUT build a support set S == {(Si' V(Si))} of states whose optimal values V (s) are known with reasonable accuracy. [sent-36, score-0.145]
19 Both algorithms initialize S with a set of terminal states (with V(s) == 0). [sent-37, score-0.227]
20 In each iteration, a function approximator V is fit to S to minimize :Ei[V(Si) - V(Si)]2. [sent-38, score-0.153]
21 Then, an exploration process attempts to identify new points to include in S. [sent-39, score-0.219]
22 In GROWSUPPORT, a sample of points X is initially drawn from the state space. [sent-40, score-0.119]
23 In each iteration, after fitting V, GROWSUPPORT computes a new estimate V(s) for each state sEX according to V(s) == max a r(s, a) + V(a(s)), where V(a(s)) is computed by executing the greedy policy with respect to V starting in a(s). [sent-41, score-0.494]
24 Let P(s'ls, a) be the probability that action a in state s results in state s' and R(s'ls, a) be the expected one-step reward. [sent-44, score-0.321]
25 If such a state is found, then (s, V(s)) is added to S. [sent-46, score-0.145]
26 Both GROWSUPPORT and ROUT rely on the function approximator to generalize well at the boundaries of the support set. [sent-47, score-0.118]
27 A new state s can only be added to S if V has generalized to all of s's successor states. [sent-48, score-0.145]
28 H this occurs consistently, then eventually the support set will expand to include all of the starting states of the MDP, at which point a satisfactory policy has been found. [sent-49, score-0.347]
29 Unfortunately, most regression methods have high bias and variance near the boundaries of their training data, so failures of boundary generalization are common. [sent-52, score-0.064]
30 In BFBP, the exploration process maintains a data structure S that stores the best known path from the start state to a terminal state and a "tree" of one-step departures from this best path (Le. [sent-54, score-1.062]
31 , states that can be reached by executing an action in some state on the best path). [sent-55, score-0.39]
32 At each state Si E S, the data structure stores the action executed in that state (to reach the next state in the path), the one-step reward ri, and the estimated value V(Si). [sent-56, score-0.552]
33 S also stores each action a_ that causes a departure from the best path along with the resulting state S_, reward r_ and estimated value V(s_). [sent-57, score-0.54]
34 We will denote by B the subset of S that constitutes the best path. [sent-58, score-0.078]
35 For states S'i E B, V(Si) is computed 'by summing the immediate rewards r j for all steps j 2: i along B. [sent-60, score-0.12]
36 For the one-step departure states s_, V(s_) is computed from an exploration trial in which the greedy policy was followed starting in state s_. [sent-61, score-0.75]
37 at fuitially, S is empty, so a random trajectory is generated from the start state So to a terminal state, and it becomes the initial best known path. [sent-62, score-0.38]
38 fu subsequent iterations, is chosen and executed to a state Si E B is chosen at random, and an action 1= produce state s' and reward r'. [sent-63, score-0.36]
39 Then the greedy policy (with respect to the current V) is executed until a terminal state is reached. [sent-64, score-0.503]
40 The rewards along this new path are summed to produce V(s'). [sent-65, score-0.125]
41 If V(s') +r' > V(Si), then the best path is revised as follows. [sent-66, score-0.167]
42 The new best action in state Si becomes al with estimated value V(s') +r' . [sent-67, score-0.31]
43 This improved value is then propagated backwards to update the V estimates for in state Si becomes an inferior all ancestor states in B. [sent-68, score-0.314]
44 The old best action action a_ with result state s_. [sent-69, score-0.363]
45 Finally all descendants of s_ along the old best path are deleted. [sent-70, score-0.203]
46 This method of investigating one-step departures from the best path is inspired by Harvey and Ginsberg's (1995) limited discrepancy search (LDS) algorithm. [sent-71, score-0.283]
47 In each exploration phase, K one-step departure paths are explored. [sent-72, score-0.37]
48 a' at at After the exploration phase, the value function approximation V is recomputed with the goal of minimizing a combined error function: J(V) == As L (V(s) - V(S))2 + Ab L (V(s) sES Aa L L [r(s, a*) + V(a*(s))])2 + sEB ([r(s,a-) + V(a-(s))] - [r(s,a*) + V(a*(s))]):. [sent-73, score-0.249]
49 The supervised term is the usual squared error between the V(s) values stored in S and the fitted values V(s). [sent-76, score-0.082]
50 The Bellman term is the squared error between the fitted value and the backed-up value of the next state on the best path. [sent-77, score-0.284]
51 And the advantage term penalizes any case where the backed-up value of an inferior action a_ is larger than the backed-up value of the best action a*. [sent-78, score-0.389]
52 TheoreIll 1 Let M be a deterministic MDP such that (aJ there are only a finite number of starting states, (bJ there are only· a finite set of actions executable in each state, and (c) all policies reach a terminal state. [sent-80, score-0.173]
53 Proof: The LDS exploration process is monotonic, since the data structure S is only updated if a new best path is found. [sent-82, score-0.386]
54 The conditions of the theorem imply that there are only a finite number of possible paths that· can be explored from the starting states to the terminal states. [sent-83, score-0.351]
55 Consequently, the value function V fit to S will also converge. [sent-85, score-0.091]
56 There are cycles in our jobshop scheduling problems, but we eliminate them by remembering all states visited along the current trajectory and barring any action that would return to a previously visited state. [sent-90, score-0.43]
57 Note also that the theorem applies to MDPs with continuous state spaces provided the action space and the start states are finite. [sent-91, score-0.318]
58 Unfortunately, BFBP does not necessarily converge to an optimal policy. [sent-92, score-0.065]
59 This is because LDS exploration can get stuck in a local optimum such that all one step departures using the V-greedy policy produce trajectories that do not improve over the current best path. [sent-93, score-0.618]
60 Hence, although BFBP resembles policy iteration, it does not have the same optimality guarantees,. [sent-94, score-0.203]
61 because policy iteration evaluates the current greedy policy in all states in the state space. [sent-95, score-0.672]
62 Theoretically, we could prove convergence to the optimal policy under modified conditions. [sent-96, score-0.238]
63 If we replace LDS exploration with €-greedy exploration, then exploration will converge to the optimal paths with probability 1. [sent-97, score-0.592]
64 When trained on those paths, if the function approximator fits a sufficiently accurate V, then BFBS will converge optimally. [sent-98, score-0.122]
65 hI our experiments, however, we have found that €-greedy gives no improvement over LDS, whereas LDS exploration provides more complete coverage of one-step departures from the current best path, and these are used in J(V). [sent-99, score-0.38]
66 3 Experimental Evaluation We have studied five domains: Grid World and Puddle World (Boyan & Moore, 1995), Mountain Car (Sutton, 1996), and resource-constrained scheduling problems ART-1 and ART-2 (Zhang & Dietterich, 1995). [sent-100, score-0.124]
67 For the final domain, it is difficult to draw a sample of states X from the state space to initialize GROWSUPPORT. [sent-102, score-0.229]
68 As mentioned above, we detected and removed cycles from the scheduling domain (since ROUT requires this). [sent-104, score-0.158]
69 We retained the cycles in the first three problems. [sent-105, score-0.06]
70 On mountain car, we also applied SARSA(A) with the CMAC function approximator developed by Sutton (1996). [sent-106, score-0.237]
71 We experimented with two function approximators: regression trees (RT) and locally-weighted linear regression (LWLR). [sent-107, score-0.134]
72 Our regression trees employ linear separating planes at the internal nodes and linear surfaces at the leaf nodes. [sent-108, score-0.089]
73 To determine the splitting plane at a node, we choose a state Si at random from S, choose one of its inferior children S_, and construct the plane that is the perpendicular bisector of these two points. [sent-110, score-0.267]
74 The splitting plane is evaluated by fitting the resulting child nodes to the data (as leaf nodes) and computing the value of J (V). [sent-111, score-0.185]
75 A number C of parent-child pairs (Si' S - ) are generated and evaluated, and the best one is retained to be the splitting plane. [sent-112, score-0.147]
76 This process is then repeated recursively until a node contains fewer than M data points~ The linear surfaces at the leaves are trained by gradient descent to minimize J(V). [sent-113, score-0.058]
77 In our experiments, we tried all combinations of the following parameters and report the best results: (a) 11 learning rates (from 0. [sent-115, score-0.078]
78 We experimented with all combinations of the following parameters and report the best results: (a) 29 values (from 0. [sent-127, score-0.111]
79 We execute BFBP for 100 iterations, but it converges much earlier: 36 iterations for the grid world, 3 for puddle world, 10 for mountain car, and 5 for the job-shop scheduling problems. [sent-133, score-0.424]
80 Table 1 compares the results of the algorithms on the toy domains with parameters for each method tuned to give the best results and with As == 1 and Ab == Aa == o. [sent-134, score-0.166]
81 In Mountain Car, in particular, we were pleased that BFBP discovered the optimal policy very quickly. [sent-136, score-0.238]
82 Table 2 compares the results of ROUT and BFBP on job-shop scheduling problem TRNOO from problem set ART-1 (again with As == 1 and Ab == Aa == 0). [sent-137, score-0.124]
83 We conjecture that this is because ROUT needs a value function approximator that is conservative near the boundary of the training data, whereas BFBP does not. [sent-140, score-0.148]
84 We report both the best policy found during the iterations and the final policy at convergence. [sent-141, score-0.51]
85 Figure 1 plots the r,esults for ROUT (LWLR) against BFBP (RT) for eight additional scheduling problems from ART-I. [sent-142, score-0.124]
86 The experiments above all employed only the supervised term in the error function J. [sent-145, score-0.055]
87 These experiments demonstrate that LDS exploration gives better training sets than the support set methods of GROWSUPPORT and ROUT. [sent-146, score-0.245]
88 Now we turn to the question of whether the Bellman and advantage terms can provide improved results. [sent-147, score-0.072]
89 For the grid world and puddle world tasks, the supervised term already gives optimal performance, so we focus on the mountain car and job-shop scheduling problems. [sent-148, score-0.715]
90 Table 3 summarizes the results for BFBP on the mountain car problem. [sent-149, score-0.261]
91 4 best policy explored + y=xbest finalleamed policy x 2. [sent-152, score-0.515]
92 - -_ _- - - l - -_ _- - ' - -_ _- - L o 10 15 - ' - - -_ _- - - ' -_ _-----' 20 25 30 LOS iteration Figure 2: BFBP on ART-2 using neural nets and regression trees. [sent-204, score-0.072]
93 4 Conclusions This paper has shown that the exploration strategies underlying GROWSUPPORT and ROUT can be improved by simply remembering and training on the best paths found between start and terminal states. [sent-209, score-0.596]
94 In addition, we have shown that the performance of our BFBP algorithm can be further improved (and made more robust) by incorporating a penalty for violations of the Bellman equation or a penalty for preferring inferior actions (an advantage error). [sent-211, score-0.145]
95 Taken together, these results show that incremental-batch value function approximation can be a reliable, convergent method for solving deterministic reinforcement learning problems. [sent-212, score-0.153]
96 The key to the success of the method is the ability to separate the exploration process from the function approximation process and to make the exploration process convergent. [sent-213, score-0.438]
97 Generalization in reinforcement learning: Safely approximating the value function. [sent-251, score-0.153]
98 Policy gradient methods for reinforcement learning with function approximation. [sent-281, score-0.155]
99 Policy gradient methods for reinforcement learning with function approximation. [sent-300, score-0.155]
100 Generalization in reinforcement learning: Successful examples using sparse coarse coding. [sent-305, score-0.123]
wordName wordTfidf (topN-words)
[('bfbp', 0.52), ('rout', 0.395), ('growsupport', 0.291), ('exploration', 0.219), ('policy', 0.203), ('lds', 0.166), ('mountain', 0.145), ('scheduling', 0.124), ('reinforcement', 0.123), ('state', 0.119), ('si', 0.118), ('car', 0.116), ('terminal', 0.113), ('lwlr', 0.104), ('sarsa', 0.104), ('boyan', 0.099), ('approximator', 0.092), ('path', 0.089), ('paths', 0.089), ('states', 0.084), ('fitting', 0.083), ('departures', 0.083), ('puddle', 0.083), ('action', 0.083), ('best', 0.078), ('yes', 0.072), ('moore', 0.068), ('bellman', 0.068), ('departure', 0.062), ('fit', 0.061), ('world', 0.058), ('sutton', 0.057), ('supervised', 0.055), ('rt', 0.055), ('zhang', 0.047), ('inferior', 0.047), ('phase', 0.046), ('ab', 0.045), ('splitting', 0.043), ('stores', 0.043), ('ginsberg', 0.042), ('harvey', 0.042), ('pioneered', 0.042), ('rdf', 0.042), ('grid', 0.041), ('executed', 0.039), ('regression', 0.038), ('advantage', 0.038), ('online', 0.038), ('trajectory', 0.038), ('splits', 0.036), ('corvallis', 0.036), ('oregon', 0.036), ('francisco', 0.036), ('along', 0.036), ('trajectories', 0.035), ('optimal', 0.035), ('iteration', 0.034), ('starting', 0.034), ('batch', 0.034), ('cycles', 0.034), ('improved', 0.034), ('aa', 0.033), ('discrepancy', 0.033), ('approximators', 0.033), ('experimented', 0.033), ('tsitsiklis', 0.033), ('dietterich', 0.032), ('gradient', 0.032), ('toy', 0.032), ('start', 0.032), ('mdp', 0.031), ('explored', 0.031), ('baird', 0.031), ('maxa', 0.031), ('mansour', 0.031), ('remembering', 0.031), ('gordon', 0.031), ('execute', 0.031), ('converge', 0.03), ('morgan', 0.03), ('algorithms', 0.03), ('value', 0.03), ('greedy', 0.029), ('plane', 0.029), ('episodic', 0.029), ('konda', 0.029), ('fitted', 0.027), ('domains', 0.026), ('est', 0.026), ('surfaces', 0.026), ('executing', 0.026), ('retained', 0.026), ('support', 0.026), ('added', 0.026), ('boundary', 0.026), ('actions', 0.026), ('final', 0.026), ('trees', 0.025), ('mcallester', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
2 0.24594039 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
3 0.18813409 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
4 0.16778156 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
5 0.16062079 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
7 0.14303567 59 nips-2001-Direct value-approximation for factored MDPs
8 0.11463653 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
9 0.10850601 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
10 0.10265043 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
11 0.093223535 128 nips-2001-Multiagent Planning with Factored MDPs
12 0.081757963 36 nips-2001-Approximate Dynamic Programming via Linear Programming
13 0.079791173 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
14 0.066982515 35 nips-2001-Analysis of Sparse Bayesian Learning
15 0.066001505 126 nips-2001-Motivated Reinforcement Learning
16 0.064639471 148 nips-2001-Predictive Representations of State
17 0.060136661 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
18 0.054980513 115 nips-2001-Linear-time inference in Hierarchical HMMs
19 0.050215065 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
20 0.050001506 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay
topicId topicWeight
[(0, -0.169), (1, -0.093), (2, 0.335), (3, -0.0), (4, 0.012), (5, 0.091), (6, -0.012), (7, -0.027), (8, 0.012), (9, 0.0), (10, 0.029), (11, -0.024), (12, -0.004), (13, -0.007), (14, 0.03), (15, -0.045), (16, -0.008), (17, 0.019), (18, 0.019), (19, -0.027), (20, 0.043), (21, -0.052), (22, 0.018), (23, -0.073), (24, 0.056), (25, 0.004), (26, 0.076), (27, 0.005), (28, 0.026), (29, 0.011), (30, 0.108), (31, 0.02), (32, 0.054), (33, -0.034), (34, 0.011), (35, 0.011), (36, 0.049), (37, 0.025), (38, -0.008), (39, -0.037), (40, -0.044), (41, -0.006), (42, -0.024), (43, -0.005), (44, 0.145), (45, 0.106), (46, -0.095), (47, 0.043), (48, 0.073), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.95147246 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
2 0.83976597 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
3 0.79791677 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
4 0.74960738 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
6 0.73129886 13 nips-2001-A Natural Policy Gradient
7 0.62551683 59 nips-2001-Direct value-approximation for factored MDPs
8 0.58848995 148 nips-2001-Predictive Representations of State
9 0.58406126 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
10 0.58282954 126 nips-2001-Motivated Reinforcement Learning
11 0.5610736 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
12 0.55868143 128 nips-2001-Multiagent Planning with Factored MDPs
13 0.50606251 177 nips-2001-Switch Packet Arbitration via Queue-Learning
14 0.50078547 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
15 0.47424042 36 nips-2001-Approximate Dynamic Programming via Linear Programming
16 0.44937837 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
17 0.4222025 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
18 0.37338012 35 nips-2001-Analysis of Sparse Bayesian Learning
19 0.36838779 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
20 0.31678545 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
topicId topicWeight
[(14, 0.027), (16, 0.019), (17, 0.044), (19, 0.04), (27, 0.08), (30, 0.069), (42, 0.319), (59, 0.026), (72, 0.096), (79, 0.048), (83, 0.025), (91, 0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.80994529 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
2 0.65919352 56 nips-2001-Convolution Kernels for Natural Language
Author: Michael Collins, Nigel Duffy
Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
3 0.54566121 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
4 0.51114923 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter
Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3, 2 µu (y) min σ 2 = σ 2 − E E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.
5 0.51035631 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
7 0.50694716 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
8 0.50687802 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
9 0.50632066 1 nips-2001-(Not) Bounding the True Error
10 0.50360835 36 nips-2001-Approximate Dynamic Programming via Linear Programming
11 0.50323343 121 nips-2001-Model-Free Least-Squares Policy Iteration
12 0.50192887 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
13 0.50183487 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
14 0.50077438 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
15 0.49772403 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
16 0.49724931 59 nips-2001-Direct value-approximation for factored MDPs
17 0.49627256 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
18 0.49547124 185 nips-2001-The Method of Quantum Clustering
19 0.49529421 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
20 0.49422425 161 nips-2001-Reinforcement Learning with Long Short-Term Memory