nips nips2001 nips2001-59 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Direct value-approxiIllation for factored MDPs Dale Schuurmans and ReIn Patrascll Department of Computer Science University of Waterloo {dale, rpatrasc} @cs. [sent-1, score-0.351]
2 ca Abstract We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. [sent-3, score-0.857]
3 Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. [sent-4, score-0.428]
4 By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. [sent-5, score-0.34]
5 This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). [sent-6, score-0.442]
6 However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. [sent-7, score-0.342]
7 Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. [sent-8, score-0.059]
8 1 Introduction Markov decision processes (MDPs) form a foundation for control in uncertain and stochastic environments and reinforcement learning. [sent-9, score-0.072]
9 The real goal is to achieve solution methods that scale up reasonably in the size of the state description, not the size of the state space itself (which is usually either exponential or infinite). [sent-12, score-0.301]
10 There are two basic premises on which solution methods can scale up: (1) exploiting structure in the MDP model itself (i. [sent-13, score-0.104]
11 structure in the reward function and the state transition model); and (2) exploiting structure in an approximate representation of the optimal value function (or policy). [sent-15, score-0.599]
12 Even then, it is surprisingly difficult to formulate an optimization method that can handle large state descriptions and yet simultaneously produce value functions or policies with small approximation errors, or errors that can be bounded tightly. [sent-17, score-0.412]
13 optimal policies based on a simple direct linear programming approach. [sent-19, score-0.492]
14 Specifically, the idea is to approximate the optimal value function by formulating a single linear program and exploiting structure in the MDP and the value function approximation to solve this linear program efficiently. [sent-20, score-0.854]
15 2 Preliminaries We consider MDPs with finite state and action spaces and consider the goal of maximizing infinite horizon discounted reward. [sent-21, score-0.271]
16 In this paper, states will be represented by vectors x of length n, where for simplicity we assume the state variables Xl, . [sent-22, score-0.272]
17 , X n are in {O, I}; hence the total nuniber of states is N == 2n . [sent-25, score-0.045]
18 The problem is to determine an optimal control policy 1r* : X --7 A that achieves maximum expected future discounted reward in every state. [sent-31, score-0.542]
19 To understand the standard solution methods it is useful to define some auxiliary concepts. [sent-32, score-0.072]
20 For any policy 1r, the value function V 7r : X --7 JR denotes the expected future discounted reward achieved by policy 1r in each state x. [sent-33, score-0.968]
21 Given these definitions, the three culating 1r* can be formulated as: V* denote its value function, (x) == maXa (Ba f) (x). [sent-35, score-0.064]
22 If, in have 1r*(x) == 1rgre (V*)(x) == fundamental methods for cal- Policy iteration: Start with an arbitrary policy 1r(0). [sent-36, score-0.332]
23 Iterate f(i+l) untilllf(i+l) - f(i) 1100 < tole Return 1r* == 1rgre (f(i+I)). [sent-40, score-0.055]
24 Linear programming: Calculate V* == arg min] (B a f) (x) for all a and x. [sent-41, score-0.092]
25 I:x I(x) f- B* f(i) subject to f(x) 2=: All three methods can be shown to produce optimal policies for the given MDP [1, 10] even though they do so in very different ways. [sent-43, score-0.189]
26 However, all three approaches share the same fundamental limitation that they do not scale up feasibly in n, the size of the state descriptions. [sent-44, score-0.131]
27 Instead, all of these approaches work with explicit representations of the policies and value functions that are exponential in n. [sent-45, score-0.222]
28 3 Exploiting structure To scale up to large state spaces it is necessary to exploit substantial structure in the MDP while also adopting some form of approximation for the optimal value function and policy. [sent-46, score-0.323]
29 The two specific structural assumptions we consider in this paper are (1) factored MDPs and (2) linear value function approximations. [sent-47, score-0.507]
30 Neither of these two assumptions alone is sufficient to permit efficient policy optimization for large MDPs. [sent-48, score-0.396]
31 However, combined, the two assumptions allow approximate solutions to be obtained for problems involving trillions of states reasonably quickly. [sent-49, score-0.134]
32 1 Factored MDPs In the spirit of [7, 8, 6] we define a factored MDP to be one that can be represented compactly by an additive reward function and a factored state transition model. [sent-51, score-1.127]
33 Let Xa,i denote the parents of successor variable x~ in the DBN for action a. [sent-53, score-0.153]
34 To allow efficient optimization we assume the patent set Xa,i contains a small number of state variables from the previous time step. [sent-54, score-0.245]
35 Unfortunately, as pointed out in [7], a factored MDP does not by itself yield a feasible method to determining optimal policies. [sent-56, score-0.526]
36 The main problem is that, even if P and R are factored, the optimal value function generally does not have a compact representation (nor does the optimal policy). [sent-57, score-0.361]
37 Therefore, obtaining an exact solution appears to require a return to explicit representations. [sent-58, score-0.099]
38 However, it turns out that the factored MDP representation interacts very well with linear value function approximations. [sent-59, score-0.553]
39 2 Linear approximation One of the central tenets to scaling up is to approximate the optimal value function rather than calculate it exactly. [sent-61, score-0.314]
40 However, the simplest of these is generalized linear functions, which is the form we investigate below. [sent-64, score-0.092]
41 In this case, we consider functions of the form f(x) 2:;=1 wjbj(xj) where b1 , ••• , bk are a fixed set of basis functions, and Xj denotes the variables on which basis bj depends. [sent-65, score-0.277]
42 Combining linear functions with, factored MDPs provides many opportunities for feasible approximation. [sent-66, score-0.587]
43 = The first main benefit of combining linear approximation with factored MDPs is that the result of applying the backup operator B a to a linear function results in a compact representation for the action-value function. [sent-67, score-0.858]
44 The ability to represent the state-action value function in a compact linear form immediately provides a feasible implementation of the greedy policy for f, since 1rgre (f) (x) == argmaXa g(~, a) by definition of 1rgre , and g(x, a) is efficiently determinable for each x and a. [sent-70, score-0.853]
45 However, it turns out that this is not enough to permit feasible forms of approximate policy- and value-iteration to be easily implemented. [sent-71, score-0.277]
46 The main problem is that even though Ba f has a factored form for fixed a, B* f does not and (therefore) neither does 1rgre (f). [sent-72, score-0.46]
47 In fact, even if a policy 1f were concisely represented, B 1r f would not necessarily have a compact form because 1f usually depends on all the state variables and thus P(x/lx, 1r(x)) == I17=1 P(x~IX1r(x),i) becomes a product of terms that depend on all the state variables. [sent-73, score-0.805]
48 Here [8, 6] introduce an additional assumption that there is a special "default" action ad for the MDP such that all other actions a have a factored transition model P (·1·, a) that differs from P(·I·, ad) only on a small number of state variables. [sent-74, score-0.664]
49 This allows the greedy policy 1rgre (f) to have a compact form and moreover allows B 1r gre(f) f to be concisely represented. [sent-75, score-0.546]
50 With some effort, it then becomes possible to formulate feasible versions of approximate policy- and value-iteration [8, 6]. [sent-76, score-0.195]
51 Approximate policy iteration: Start with default policy 1r(O)(x) == ad. [sent-77, score-0.662]
52 Iterate f(i) +- arg minf maxx If(x) - (B 1r (i) f) (x) I , 1r(i+1) f- 1fgre (f(i)) until1r(i+1) == 1r(i). [sent-78, score-0.36]
53 Iterate 1r(i) +1rgre (f(i)) ,f(i+1) +- argminf maxx 1! [sent-80, score-0.213]
54 (i) 1100 < tole The most expensive part of these iterative algorithms is determining arg minf maxx If(x) - (B7r(i) f) (x) I which involves solving a linear program minw,E E subject to -E :S ! [sent-82, score-0.618]
55 This linear program is problematic because it involves an exponential number of constraints. [sent-84, score-0.203]
56 A· central achievement of [6] is to show that this system of constraints can be encoded by an equivalent system of constraints that has a much more compact form. [sent-85, score-0.372]
57 The idea behind this construction is to realize that searching for the max or a min of a linear function with a compact basis can be conducted in an organized fashion, and such an organized search can be encoded in an equally concise constraint system. [sent-86, score-0.421]
58 This construction allows approximate solutions to MDPs with up to n == 40 state variables (1 trillion states) to be generated in under 7. [sent-87, score-0.274]
59 1 1 It turns out that approximate value iteration is less effective because it takes more iterations to converge, and in fact can diverge in theory [6, 13]. [sent-89, score-0.297]
60 Our main observation is that if one has to solve linear programs to conduct the approximate iterations anyway, then it might be much simpler and more efficient to approximate the linear programming approach directly. [sent-90, score-0.705]
61 In fact, it is well known [1, 2] that this yields a linear program in the basis weights w. [sent-92, score-0.248]
62 However, what had not been previously shown is that given a factored MDP, an equivalent linear program of feasible size could be formulated. [sent-93, score-0.66]
63 First, one can show that the minimization objective can be encoded compactly k Lf(x) x LLWjbj(xj) x j=l k LWjYj where Yj == 2n-lxjl Lbj(xj) ~ j=l Here the Yj components can be easily precomputed by enumerating assignments to the small sets of variables in basis functions. [sent-95, score-0.219]
64 Second, as we have seen, the exponentially many constraints have a structured form. [sent-96, score-0.102]
65 The reward in a state is simply the sum of systems that are up, with a bonus reward of 1 if system 1 (the server) is up. [sent-102, score-0.361]
66 We considered the network architectures shown in Figure 1 and used the transition probabilities P(x~ == llxi, parent(Xi) , a == i) == 0. [sent-106, score-0.066]
67 The first basis functions we considered were just the indicators on each variable Xi plus a constant basis function (as reported in [6]). [sent-113, score-0.164]
68 Our approximate linear programming method is labeled ALP and is compared to the approxi. [sent-115, score-0.339]
69 mate server 0 n= N= API[6] APIgen time ALP ALPgen ALPgen2 APIgen constraints ALP ALPgen ALPgen2 API[6] DB Bellman APIgen ALP (gen) / Rmax ALPgen2 time constraints DB Bellman / Rmax 12 4e3 7m 39s 4. [sent-116, score-0.244]
70 03 Figure 1: Experimental results (timings on a 750MHz PIlI processor, except 2) policy iteration strategy API described in [6]. [sent-210, score-0.398]
71 Since we did not have the specific probabilities used in [6] and could only estimate the numbers for API from graphs presented in the paper, this comparison is only meant to be loosely indicative of the general run times of the two methods on such problems. [sent-211, score-0.036]
72 ) As in [6] our implementation is based on Matlab, using CPLEX to solve linear programs. [sent-213, score-0.13]
73 Our preliminary results appear to support the hypothesis that direct linear programming can be more efficient than approximate policy iteration on problems of this type. [sent-214, score-0.85]
74 A further advantage of the linear programming approach is that it is simpler to program and involves solving only one LP. [sent-215, score-0.361]
75 More importantly, the direct LP approach does not require the MDP to have a special default action since the action-value function can be directly extracted using 7r gre (f)(x) == argma:xay(x,a) and g is easily recoverable from f. [sent-216, score-0.252]
76 Before discussing drawbacks, we note that it is possible to solve the linear program even more efficiently by iteratively generating constraints as needed. [sent-217, score-0.454]
77 Specifically, the procedure ALPgen exploits the feasible search techniques for minimizing linear functions discussed previously to efficiently generate a small set of critical constraints, which is iteratively grown until the final solution is identified; see Figure 2. [sent-219, score-0.351]
78 , a) B a f for each a, to represent the greedy policy = Figure 2: ALPgen procedure The rationale for this procedure is that the main bottleneck in the previous methods is generating the constraints, not solving the linear programs [6]. [sent-225, score-0.571]
79 Since only a small number of constraints are active at a solution and these are likely t. [sent-226, score-0.141]
80 o be the most violated near the solution, adding only most violated constraints appears to be a useful way to proceed. [sent-227, score-0.232]
81 Indeed, Figure 1 shows that ALPgen produces the same approximate solutions as ALP in a tiny fraction of the time. [sent-228, score-0.089]
82 In the most extreme case ALPgen produces an approximate solution in 7 seconds while other methods take several hours on the same problem. [sent-229, score-0.178]
83 The reason for this speedup is explained by the results which show the numbers of constraints generated by each method. [sent-230, score-0.102]
84 Further investigation is also required to fully outline the robustness of the constraint generation method. [sent-231, score-0.078]
85 In fact, one cannot guarantee that a greedy constraint generation scheme like the one proposed here will always produce a feasible number of constraints [9]. [sent-232, score-0.343]
86 Nevertheless, the potential benefits of conservatively generating constraints as needed seem to be clear. [sent-233, score-0.137]
87 Of course, the main drawback of the direct linear programming approach over approximate policy iteration is that ALP incurs larger approximation errors than API. [sent-234, score-0.884]
88 5 Bounding approximation error It turns out that neither API nor ALP are guaranteed to return the best linear approximation to the true value function. [sent-235, score-0.416]
89 Nevertheless, it is possible to efficiently calculate bounds on the approximation errors of these methods, again by exploiting the structure of the problem: A well known result [14] asserts that maxx V* (x) - V 7rgre (J) (x) :S 1 2, maxx f(x) - (B* f) (x) (where in our case f ~ V*). [sent-236, score-0.659]
90 This upper bound can in turn be bounded by a quantity that is feasible to calculate: maxx f(x)-(B* f) (x) = maxxmina f(x)-(Ba f) (x) :S min a maxx f(x)-(B af)(x). [sent-237, score-0.532]
91 Thus an upper bound on the error from the optimal value function can be calculated by performing an efficient search for maxx f(x) - (Baf) (x) for each a. [sent-238, score-0.406]
92 Figure 1 shows that the measurable error quantity, maxx f(x) - (B a f) (x) (reported as UB Bellman) is about a factor of two larger for the linear programming approach than for approximate policy iteration on the same basis. [sent-239, score-0.95]
93 In this respect, API appears to have an inherent advantage (although in the limit of an exhaustive basis both approaches converge to the same optimal value). [sent-240, score-0.114]
94 To get an indication of the computational cost required for ALPgen to achieve a similar bound on approximation error, we repeated the same experiments with a larger basis set that included all four indicators between pairs of connected variables. [sent-241, score-0.14]
95 The results for this model are reported as ALPgen2, and Figure 1 shows that, indeed, the bound on approximation error is reduced substantially-but at the predictable cost of a sizable increase in computation time. [sent-242, score-0.059]
96 However, the run times are still appreciably smaller than the policy iteration methods. [sent-243, score-0.398]
97 Paradoxically, linear programming seems to offer computational advantages over policy and value iteration in the context of approximation, even though it is widely held to be an inferior solution strategy for explicitly represented MDPs. [sent-244, score-0.793]
98 Computing factored value functions for policies in structured MDPs. [sent-277, score-0.573]
99 Learning and value function approximation in complex decision processes. [sent-306, score-0.16]
100 Tight performance bounds on greedy policies based _on imperfect value functions. [sent-311, score-0.241]
wordName wordTfidf (topN-words)
[('factored', 0.351), ('policy', 0.3), ('alp', 0.273), ('alpgen', 0.273), ('maxx', 0.213), ('mdp', 0.191), ('api', 0.19), ('mdps', 0.179), ('apigen', 0.164), ('programming', 0.158), ('state', 0.131), ('compact', 0.124), ('policies', 0.12), ('reward', 0.115), ('program', 0.111), ('feasible', 0.106), ('constraints', 0.102), ('iteration', 0.098), ('parent', 0.096), ('arg', 0.092), ('linear', 0.092), ('approximate', 0.089), ('xj', 0.085), ('action', 0.082), ('xi', 0.079), ('efficiently', 0.076), ('backup', 0.071), ('concise', 0.071), ('optimal', 0.069), ('iterate', 0.066), ('transition', 0.066), ('concisely', 0.065), ('violated', 0.065), ('exploiting', 0.065), ('value', 0.064), ('specifically', 0.062), ('default', 0.062), ('return', 0.06), ('efficient', 0.06), ('approximation', 0.059), ('discounted', 0.058), ('greedy', 0.057), ('bj', 0.057), ('baf', 0.055), ('gre', 0.055), ('minf', 0.055), ('tole', 0.055), ('variables', 0.054), ('direct', 0.053), ('programs', 0.052), ('koller', 0.052), ('hours', 0.05), ('rmax', 0.047), ('minx', 0.047), ('turns', 0.046), ('states', 0.045), ('constraint', 0.045), ('bellman', 0.045), ('basis', 0.045), ('encoded', 0.044), ('dale', 0.043), ('dbn', 0.043), ('gen', 0.043), ('represented', 0.042), ('maxa', 0.04), ('server', 0.04), ('solution', 0.039), ('xl', 0.038), ('fixed', 0.038), ('athena', 0.038), ('enumerating', 0.038), ('ijcai', 0.038), ('compactly', 0.038), ('successor', 0.038), ('yj', 0.038), ('functions', 0.038), ('solve', 0.038), ('decision', 0.037), ('neither', 0.036), ('permit', 0.036), ('loosely', 0.036), ('indicators', 0.036), ('main', 0.035), ('generating', 0.035), ('start', 0.035), ('reinforcement', 0.035), ('scientific', 0.034), ('specifies', 0.034), ('immediately', 0.034), ('actions', 0.034), ('operator', 0.034), ('ba', 0.034), ('generation', 0.033), ('approximating', 0.033), ('parents', 0.033), ('calculate', 0.033), ('define', 0.033), ('arbitrary', 0.032), ('nevertheless', 0.032), ('dynamic', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
2 0.30881134 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
3 0.26602462 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
4 0.25567478 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
5 0.23930736 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.2256327 36 nips-2001-Approximate Dynamic Programming via Linear Programming
7 0.20639323 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
8 0.1914089 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
9 0.18662383 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
11 0.14303567 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
12 0.11093906 115 nips-2001-Linear-time inference in Hierarchical HMMs
13 0.094596662 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
14 0.086144567 148 nips-2001-Predictive Representations of State
15 0.083339095 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
16 0.073383018 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games
17 0.073152147 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
18 0.072529472 126 nips-2001-Motivated Reinforcement Learning
19 0.071122378 8 nips-2001-A General Greedy Approximation Algorithm with Applications
20 0.070051819 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
topicId topicWeight
[(0, -0.243), (1, -0.123), (2, 0.456), (3, -0.014), (4, 0.044), (5, 0.109), (6, -0.022), (7, -0.073), (8, 0.02), (9, 0.04), (10, 0.032), (11, 0.007), (12, 0.073), (13, 0.009), (14, 0.045), (15, -0.079), (16, -0.052), (17, -0.062), (18, -0.012), (19, 0.002), (20, 0.029), (21, -0.032), (22, -0.021), (23, -0.123), (24, -0.039), (25, -0.047), (26, 0.0), (27, -0.037), (28, 0.082), (29, 0.05), (30, -0.057), (31, -0.021), (32, -0.019), (33, 0.097), (34, 0.005), (35, 0.102), (36, -0.154), (37, -0.061), (38, 0.013), (39, 0.01), (40, 0.018), (41, 0.031), (42, -0.012), (43, -0.043), (44, -0.099), (45, -0.066), (46, 0.018), (47, -0.01), (48, 0.014), (49, -0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.97169256 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
2 0.85459828 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
3 0.84943664 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
4 0.83429539 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
5 0.81108075 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.76111364 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
7 0.69854224 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.68837827 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
10 0.67341089 13 nips-2001-A Natural Policy Gradient
11 0.5902248 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
12 0.56014657 177 nips-2001-Switch Packet Arbitration via Queue-Learning
13 0.40344515 115 nips-2001-Linear-time inference in Hierarchical HMMs
14 0.34831199 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
15 0.3378714 120 nips-2001-Minimax Probability Machine
16 0.33628038 148 nips-2001-Predictive Representations of State
17 0.33557171 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
18 0.33390948 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
19 0.32753554 183 nips-2001-The Infinite Hidden Markov Model
20 0.3204217 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
topicId topicWeight
[(14, 0.018), (17, 0.097), (19, 0.03), (27, 0.122), (30, 0.051), (38, 0.02), (45, 0.233), (59, 0.018), (72, 0.084), (79, 0.065), (83, 0.033), (91, 0.147)]
simIndex simValue paperId paperTitle
same-paper 1 0.85766405 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
2 0.76886559 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
Author: Yun Gao, Michael J. Black, Elie Bienenstock, Shy Shoham, John P. Donoghue
Abstract: Statistical learning and probabilistic inference techniques are used to infer the hand position of a subject from multi-electrode recordings of neural activity in motor cortex. First, an array of electrodes provides training data of neural firing conditioned on hand kinematics. We learn a nonparametric representation of this firing activity using a Bayesian model and rigorously compare it with previous models using cross-validation. Second, we infer a posterior probability distribution over hand motion conditioned on a sequence of neural test data using Bayesian inference. The learned firing models of multiple cells are used to define a nonGaussian likelihood term which is combined with a prior probability for the kinematics. A particle filtering method is used to represent, update, and propagate the posterior distribution over time. The approach is compared with traditional linear filtering methods; the results suggest that it may be appropriate for neural prosthetic applications.
3 0.7031557 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
5 0.69437683 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
6 0.69294357 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
7 0.69214767 121 nips-2001-Model-Free Least-Squares Policy Iteration
8 0.68765247 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
9 0.6861825 36 nips-2001-Approximate Dynamic Programming via Linear Programming
10 0.68472886 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
11 0.68257284 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
12 0.68219674 89 nips-2001-Grouping with Bias
13 0.67973316 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
14 0.67615974 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
15 0.67111468 143 nips-2001-PAC Generalization Bounds for Co-training
16 0.66901648 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
17 0.66879427 61 nips-2001-Distribution of Mutual Information
18 0.66810656 171 nips-2001-Spectral Relaxation for K-means Clustering
19 0.66606069 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
20 0.66538519 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources