nips nips2001 nips2001-40 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. [sent-5, score-0.532]
2 One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. [sent-6, score-0.421]
3 All formulations attempt to minimize the number of support vectors while fitting the data. [sent-7, score-0.463]
4 Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. [sent-8, score-0.767]
5 Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. [sent-9, score-0.607]
6 1 Introduction Virtually all existing work on value function approximation and policy-gradient methods starts with a parameterized formula for the value function or policy and then seeks to find the best policy that can be represented in that parameterized form. [sent-10, score-0.932]
7 This can give rise to very difficult search problems for which the Bellman equation is of little or no use. [sent-11, score-0.115]
8 In this paper, we take a different approach: rather than fixing the form of the function approximator and searching for a representable policy, we instead identify a good policy and then search for a function approximator that can represent it. [sent-12, score-0.638]
9 Our approach exploits the ability of mathematical programming to represent a variety of constraints including those that derive from supervised learning, from advantage learning (Baird, 1993), and from the Bellman equation. [sent-13, score-0.292]
10 By combining the kernel trick with mathematical programming, we obtain a function approximator that seeks to find the smallest number of support vectors sufficient to represent the desired policy. [sent-14, score-0.53]
11 This side-steps the difficult problem of searching for a good policy among those policies representable by a fixed function approximator. [sent-15, score-0.436]
12 Our method applies to any episodic MDP, but it works best in domains-such as resource-constrained scheduling and other combinatorial optimization problemsthat are discrete and deterministic. [sent-16, score-0.169]
13 2 Preliminaries There are two distinct reasons for studying value function approximation methods. [sent-17, score-0.144]
14 The primary reason is to be able to generalize from some set of training experiences to produce a policy that can be applied in new states that were not visited during training. [sent-18, score-0.624]
15 For example, in Tesauro's (1995) work on backgammon, even after training on 200,000 games, the TD-gammon system needed to be able to generalize to new board positions that it had not previously visited. [sent-19, score-0.108]
16 Similarly, in Zhang's (1995) work on space shuttle scheduling, each individual scheduling problem visits only a finite number of states, but the goal is to learn from a series of "training" problems and generalize to new states that arise in "test" problems. [sent-20, score-0.516]
17 The second reason to study function approximation is to support learning in continuous state spaces. [sent-22, score-0.236]
18 Specifically, we study the problem of generalizing from a partial policy to construct a complete policy for a Markov Decision Problem (MDP). [sent-27, score-0.498]
19 Formally, consider a discrete time MDP M with probability transition function P(s/ls, a) (probability that state Sl will result from executing action a in state s) and expected reward function R(s/ls, a) (expected reward received from executing action a in state s and entering state Sl). [sent-28, score-1.028]
20 We will assume that, as in backgammon and space shuttle scheduling, P(s/ls, a) and R(s/ls, a) are known and available to the agent, but that the state space is so large that it prevents methods such as value iteration or policy iteration from being applied. [sent-29, score-0.656]
21 Let L be a set of "training" states for which we have an approximation V(s) to the optimal value function V*(s), s E L. [sent-30, score-0.374]
22 In some cases, we will also assume the availability of a policy 'ff consistent with V(s). [sent-31, score-0.249]
23 The goal is to construct a parameterized approximation yes; 8) that can be applied to all states in M to yield a good policy if via one-step lookahead search. [sent-32, score-0.521]
24 In the experiments reported below, the set L contains states that lie along trajectories from a small set of "training" starting states So to terminal states. [sent-33, score-0.53]
25 A successful learning method will be able to generalize to give a good policy for new starting states not in So. [sent-34, score-0.583]
26 This was the situation that arose in space shuttle scheduling, where the set L contained states that were visited while solving "training" problems and the learned value function was applied to solve "test" problems. [sent-35, score-0.498]
27 To represent states for function approximation, let X (s) denote a vector of features describing the state s. [sent-36, score-0.321]
28 Let K(X 1 ,X2 ) be a kernel function (generalized inner product) of the two feature vectors Xl and X 2 • In our experiments, we have employed the gaussian kernel: K(X1,X2;U) == exp(-IIX1 - X 2 11 2 / ( 2 ) with parameter u. [sent-37, score-0.172]
29 3 Three LP Formulations of Function Approximation We now introduce three linear programming formulations of the function approximationproblem. [sent-38, score-0.454]
30 We first express each of these formulations in terms of a generic fitted function approximator V. [sent-39, score-0.413]
31 Then, we implement V(s) as the dot product of a weight vector W with the feature vector X (s): V(s) == W . [sent-40, score-0.044]
32 Finally, we apply the "kernel trick" by first rewriting W as a weighted sum of the training points Sj E L, W == ~j ajX(sj), (aj 2: 0), and then replacing all dot products between data points by invocations of the kernel function K. [sent-42, score-0.217]
33 We assume L con- tains all states along the best paths from So to terminal states and also all states that can be reached from these paths in one step and that have been visited during exploration (so that V is known). [sent-43, score-0.767]
34 In all three formulations we have employed linear objective functions, but quadratic objectives like those employed in standard support vector machines could be used instead. [sent-44, score-0.525]
35 All slack variables in these formulations are constrained- to be non-negative. [sent-45, score-0.465]
36 The first formulation treats the value function approximation problem as a supervised learning problem and applies the standard c-insensitive loss function (Vapnik, 2000) to fit the function approximator. [sent-47, score-0.479]
37 minimize L [u(s) + v(s)] S subject to V(s) + u(s) 2:: V(s) - c; V(s) - v(s) :::; V(s) + c "Is E L In this formulation, u(s) and v(s) are slack variables that are non-zero only if V(s) has an absolute deviation from V(s) of more than c. [sent-48, score-0.373]
38 The objective function seeks to minimize these absolute deviation errors. [sent-49, score-0.269]
39 A key idea of support vector methods is to combine this objective function with a penalty on the norm of the weight vector. [sent-50, score-0.221]
40 We can write this as minimize IIWlll + C L[u(s) + v(s)] S subject to W· X(s) + u(s) 2:: V(s) - c; W· X(s) - v(s) :::; V(s) +c "Is E L The parameter C expresses the tradeoff between fitting the data (by driving the slack variables to zero) and minimizing the norm of the weight vector. [sent-51, score-0.445]
41 We have chosen to minimize the I-norm of the weight vector (11Wlll == Ei IWi! [sent-52, score-0.105]
42 Of course, if the squared Euclidean norm of W is preferred, then quadratic programming methods could be applied to minimize this. [sent-54, score-0.229]
43 Substituting this into the constraint equations, we obtain minimize L aj + C L[u(s) + v(s)] j subject to Ej Ej 8 ajX(sj) . [sent-56, score-0.327]
44 The second formulation introduces constraints from the Bellman equation V(s) == maxa ESI P(s'ls, a)[R(s'ls, a) + V(s')]. [sent-59, score-0.19]
45 The standard approach to solving MDPs via linear programming is the following. [sent-60, score-0.086]
46 Hence, the minimization of the slack variables implements the maximization operation of the Bellman equation. [sent-62, score-0.194]
47 We attempted to apply this formulation with function approximation, but the errors introduced by the approximation make the linear program infeasible, because V(s ) must sometimes be less than the backed-up value Ls' P(s'ls, a)[R(s'ls, a) + V(s')]. [sent-63, score-0.285]
48 This led us to the following formulation in which we exploit the approximate value function 11 to provide "advice" to the LP optimizer about which constraints should be tight and which ones should be loose. [sent-64, score-0.285]
49 We can group the actions available in s into three groups: (a) the "optimal" action a* == 1f(s) chosen by the approximate policy it, (b) other actions that are tied for optimum (denoted by ao), and (c) actions that are sub-optimal (denoted by a_). [sent-66, score-0.852]
50 The second constraint requires V(s) to be at least as large as the backed-up value of any alternative optimal actions ao. [sent-68, score-0.296]
51 If V(s) is too small, it will be penalized, because the slack variable y(s, ao) will be non-zero. [sent-69, score-0.194]
52 The main effect of this constraint is to drive the value of V(s') downward as necessary to satisfy the first constraint on a*. [sent-71, score-0.191]
53 Finally, the third constraint requires that V(s) be at least € larger than the backed-up value of all inferior actions a_. [sent-72, score-0.291]
54 If these constraints can be satisfied with all slack variables u, v, y, and z set to zero, then V satisfies the Bellman equation. [sent-73, score-0.286]
55 After applying the kernel trick and introducing the regularization objective, we obtain the following Bellman formulation: minimize ~ aj + C (,~_ u(s, a*) + v(s, a*) + y(s, ao) + z(s, a_)) subject to ~a. [sent-74, score-0.474]
56 The third formulation focuses on the minimal constraints that must be satisfied to ensure that the greedy policy computed from V will be identical to the greedy policy computed from V (cf. [sent-76, score-0.803]
57 Specifically, we require that the backed up value of the optimal action a* be greater than the backed up values of all other actions a. [sent-78, score-0.508]
58 minimize L u(s,a*,a) s,a*,a subject to L P(s'ls, a*)[R(s'ls, a*) + V(s')] + u(s, a*, a) 8/ ~ LP(s! [sent-79, score-0.179]
59 ls,a) + V(s/)] +£ s/ There is one constraint and one slack variable u(s, a*, a) for every action executable in state s except for the chosen optimal action a* = i"(s). [sent-81, score-0.72]
60 The backed-up value of a* must have an advantage of at least € over any other action a, even other actions that, according to V, are just as good as a*. [sent-82, score-0.444]
61 4 Experimental Results To compare these three formulations, we generated a set of 10 random maze problems as follows. [sent-84, score-0.259]
62 In a 100 by 100 maze, the agent starts in a randomly-chosen square in the left column, (0, y). [sent-85, score-0.085]
63 Three actions are available in every state, east, northeast, and southeast, which deterministically move the agent one square in the indicated direction. [sent-86, score-0.212]
64 The maze is filled with 3000 rewards (each of value -5) generated randomly from a mixture of a uniform distribution (with probability 0. [sent-87, score-0.306]
65 Multiple rewards generated for a single state are accumulated. [sent-90, score-0.183]
66 In addition, in column 99, terminal rewards are generated according to a distribution that varies from -5 to +15 with minima at (99,0), (99,40), and (99,80) and maxima at (99,20) and (99,60). [sent-91, score-0.145]
67 These maze problems are surprisingly hard because unlike "traditional" mazes, they contain no walls. [sent-93, score-0.196]
68 In traditional n;tazes, the walls tend to guide the agent to the goal states by reducing what would be a 2-D random walk to a random walk of lower dimension (e. [sent-94, score-0.354]
69 We applied the three LP formulations in an incremental-batch method as shown in Table 1. [sent-101, score-0.334]
70 The V giving the best performance on the starting states in So over the 20 iterations was saved and evaluated over all 100 possible starting states to obtain a measure of generalization. [sent-103, score-0.486]
71 The values of C and a were determined by evaluating generalization on a holdout set of 3 start states: (0,30), (0,50), and (0,70). [sent-104, score-0.061]
72 Experimentation showed that C = 100,000 worked well for all three methods. [sent-105, score-0.063]
73 We tuned 0- 2 separately for each problem using values of 5, 10, 20, 40, 60, 80, 120, and 160; larger values were preferred in case of ties, since they give better generalization. [sent-106, score-0.033]
74 The figure shows that the three methods give essentially identical performance, and that after 3 examples, all three methods have a regret per start state of about 2 units, which is less than the cost of a single -5 penalty. [sent-108, score-0.379]
75 However, the three formulations differ in their ease of training and in the information they require. [sent-109, score-0.384]
76 A high score on this last measure indicates that the learning algorithm is not converging well, even though it may momentarily attain a good fit to the data. [sent-111, score-0.066]
77 By virtually every measure, the advantage formulation scores better. [sent-112, score-0.272]
78 It requires much less CPU time to train, finds substantially fewer support vectors, finds function approximators that give better fit to . [sent-113, score-0.258]
79 Table 2: Measures of the quality of the training process (average over 10 MDPs) 180 1= Sup Bel Adv CPU 37. [sent-116, score-0.05]
80 9 and Bellman formulations do not require the value of V, but only -fr. [sent-176, score-0.332]
81 5 Conclusions This paper has presented three formulations. [sent-178, score-0.063]
82 of batch value function approximation by exploiting the power of linear programming to express a variety of constraints and borrowing the kernel trick from support vector machines. [sent-179, score-0.6]
83 All three formulations were able to learn and generalize well on difficult synthetic maze problems. [sent-180, score-0.602]
84 The advantage formulation is easier and more reliable to train, probably because it places fewer constraints on the value function approximation. [sent-181, score-0.382]
85 ) l-I b1) 400 Q) l-I ~ (5 ~ t 200 0 0 2 3 4 5 Number of Starting States Figure 2: Comparison of the total regret (optimal total reward - attained total reward) summed over all 100 starting states for the three formulations as a function of the number of start states in So. [sent-193, score-1.098]
86 The three error bars represent the performance of the supervised, Bellman, and advantage formulations (left-to-right). [sent-194, score-0.431]
87 Average optimal total reward on these problems is 1306. [sent-196, score-0.189]
88 The random policy receives a total reward of -14,475. [sent-197, score-0.361]
wordName wordTfidf (topN-words)
[('sj', 0.28), ('formulations', 0.271), ('policy', 0.249), ('slack', 0.194), ('lp', 0.188), ('states', 0.187), ('bellman', 0.186), ('maze', 0.162), ('action', 0.159), ('formulation', 0.141), ('ao', 0.139), ('iter', 0.136), ('scheduling', 0.135), ('cpu', 0.135), ('actions', 0.127), ('trick', 0.123), ('tie', 0.118), ('approximator', 0.108), ('minimize', 0.105), ('ajx', 0.102), ('laj', 0.102), ('shuttle', 0.102), ('ej', 0.1), ('state', 0.1), ('advantage', 0.097), ('bad', 0.093), ('seeks', 0.09), ('kernel', 0.089), ('programming', 0.086), ('agent', 0.085), ('rewards', 0.083), ('aj', 0.083), ('visited', 0.08), ('reward', 0.078), ('subject', 0.074), ('sv', 0.072), ('mdp', 0.068), ('adv', 0.068), ('backgammon', 0.068), ('bel', 0.068), ('czl', 0.068), ('representable', 0.068), ('saxena', 0.068), ('utgoff', 0.068), ('fit', 0.066), ('constraint', 0.065), ('three', 0.063), ('terminal', 0.062), ('start', 0.061), ('value', 0.061), ('supervised', 0.06), ('ajk', 0.059), ('backed', 0.059), ('corvallis', 0.059), ('moll', 0.059), ('oregon', 0.059), ('perkins', 0.059), ('regret', 0.059), ('generalize', 0.058), ('starting', 0.056), ('batch', 0.056), ('mdps', 0.056), ('penalty', 0.056), ('support', 0.053), ('training', 0.05), ('baird', 0.05), ('prefers', 0.05), ('approximation', 0.049), ('constraints', 0.049), ('employed', 0.049), ('difficult', 0.048), ('tesauro', 0.047), ('sup', 0.047), ('dot', 0.044), ('reinforcement', 0.043), ('satisfied', 0.043), ('executing', 0.043), ('optimal', 0.043), ('walk', 0.041), ('objective', 0.04), ('norm', 0.038), ('zhang', 0.038), ('trajectories', 0.038), ('inferior', 0.038), ('vi', 0.038), ('iteration', 0.038), ('searching', 0.037), ('barto', 0.037), ('finds', 0.036), ('greedy', 0.036), ('parameterized', 0.036), ('combinatorial', 0.034), ('fitting', 0.034), ('virtually', 0.034), ('total', 0.034), ('function', 0.034), ('problems', 0.034), ('give', 0.033), ('find', 0.033), ('paths', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
2 0.2521269 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
3 0.24594039 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
4 0.23930736 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
5 0.23119552 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
7 0.21166278 121 nips-2001-Model-Free Least-Squares Policy Iteration
8 0.1920464 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
9 0.18634306 36 nips-2001-Approximate Dynamic Programming via Linear Programming
10 0.16472253 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
11 0.16092041 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
12 0.12414894 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
13 0.10278042 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
14 0.097701475 126 nips-2001-Motivated Reinforcement Learning
15 0.093576357 8 nips-2001-A General Greedy Approximation Algorithm with Applications
16 0.093113489 148 nips-2001-Predictive Representations of State
17 0.088088363 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
18 0.083022669 164 nips-2001-Sampling Techniques for Kernel Methods
19 0.082009479 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
20 0.077859454 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
topicId topicWeight
[(0, -0.265), (1, -0.089), (2, 0.427), (3, -0.009), (4, 0.064), (5, 0.174), (6, -0.013), (7, -0.047), (8, -0.007), (9, 0.042), (10, 0.006), (11, -0.016), (12, 0.037), (13, -0.005), (14, 0.038), (15, -0.019), (16, 0.006), (17, 0.018), (18, -0.007), (19, -0.006), (20, 0.066), (21, -0.09), (22, -0.024), (23, -0.132), (24, 0.049), (25, 0.051), (26, 0.099), (27, -0.016), (28, 0.05), (29, -0.006), (30, 0.032), (31, -0.018), (32, 0.018), (33, 0.069), (34, 0.019), (35, 0.077), (36, -0.007), (37, -0.02), (38, 0.02), (39, 0.043), (40, -0.008), (41, 0.0), (42, -0.072), (43, 0.013), (44, 0.042), (45, -0.022), (46, -0.109), (47, 0.01), (48, -0.006), (49, -0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.97191828 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
2 0.87968349 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
Author: Xin Wang, Thomas G. Dietterich
Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1
3 0.85510272 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
4 0.84795296 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
5 0.83953077 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
6 0.80153382 128 nips-2001-Multiagent Planning with Factored MDPs
7 0.76208502 36 nips-2001-Approximate Dynamic Programming via Linear Programming
9 0.68971384 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
10 0.68752265 13 nips-2001-A Natural Policy Gradient
11 0.62078297 177 nips-2001-Switch Packet Arbitration via Queue-Learning
12 0.59804994 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
13 0.52719778 126 nips-2001-Motivated Reinforcement Learning
14 0.51207638 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
15 0.50828189 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
16 0.4681752 148 nips-2001-Predictive Representations of State
17 0.41191539 51 nips-2001-Cobot: A Social Reinforcement Learning Agent
18 0.36107427 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
19 0.34380853 115 nips-2001-Linear-time inference in Hierarchical HMMs
20 0.33226848 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
topicId topicWeight
[(14, 0.045), (16, 0.146), (17, 0.067), (19, 0.027), (27, 0.088), (30, 0.043), (38, 0.015), (42, 0.042), (59, 0.02), (72, 0.236), (79, 0.027), (83, 0.022), (91, 0.132)]
simIndex simValue paperId paperTitle
same-paper 1 0.92596972 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
2 0.86852068 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
3 0.86763144 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
4 0.86578912 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
5 0.85024828 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
6 0.8226878 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
7 0.80907881 97 nips-2001-Information-Geometrical Significance of Sparsity in Gallager Codes
8 0.80691379 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
9 0.788445 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
10 0.78828561 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
11 0.7878173 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
12 0.77244264 1 nips-2001-(Not) Bounding the True Error
14 0.76351255 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
15 0.7555263 61 nips-2001-Distribution of Mutual Information
16 0.75534552 79 nips-2001-Gaussian Process Regression with Mismatched Models
17 0.75466287 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
18 0.74970722 25 nips-2001-Active Learning in the Drug Discovery Process
19 0.74914229 128 nips-2001-Multiagent Planning with Factored MDPs
20 0.74659359 155 nips-2001-Quantizing Density Estimators