nips nips2001 nips2001-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. [sent-4, score-0.499]
2 We study an efficient method based on linear programming for approximating solutions to such problems. [sent-5, score-0.417]
3 The approach "fits" a linear combination of pre- selected basis functions to the dynamic programming cost- to- go function. [sent-6, score-0.871]
4 We develop bounds on the approximation error and present experimental results in the domain of queueing network control, providing empirical support for the methodology. [sent-7, score-0.135]
5 1 Introduction Dynamic programming offers a unified approach to solving problems of stochastic control. [sent-8, score-0.583]
6 Central to the methodology is the cost- to- go function, which can obtained via solving Bellman's equation. [sent-9, score-0.251]
7 The domain of the cost- to- go function is the state space of the system to be controlled, and dynamic programming algorithms compute and store a table consisting of one cost- to- go value per state. [sent-10, score-0.995]
8 Unfortunately, t he size of a state space typically grows exponentially in the number of state variables. [sent-11, score-0.236]
9 Known as the curse of dimensionality, this phenomenon renders dynamic programming intractable in the face of problems of practical scale. [sent-12, score-0.664]
10 One approach to dealing with this difficulty is to generate an approximation within a parameterized class of functions , in a spirit similar to that of statistical regression. [sent-13, score-0.136]
11 The focus of this paper is on linearly parameterized functions: one tries to approximate the cost- to- go function J* by a linear combination of prespecified basis functions. [sent-14, score-0.485]
12 Note that this scheme depends on two important preconditions for the development of an effective approximation. [sent-15, score-0.089]
13 First, we need to choose basis functions that can closely approximate the desired cost-to-go function. [sent-16, score-0.24]
14 In this respect, a suitable choice requires some practical experience or theoretical analysis that provides rough information on the shape of the function to be approximated. [sent-17, score-0.052]
15 Second, we need an efficient algorithm that computes an appropriate linear combination. [sent-19, score-0.056]
16 The algorithm we study is based on a linear programming formulation, originally proposed by Schweitzer and Seidman [5], that generalizes the linear programming approach to exact dynamic programming, originally introduced by Manne [4]. [sent-20, score-1.083]
17 We present an error bound that characterizes the quality of approximations produced by the linear programming approach. [sent-21, score-0.544]
18 The error is characterized in relative terms, compared against the "best possible" approximation of the optimal cost-to-go function given the selection of basis functions. [sent-22, score-0.231]
19 This is the first such error bound for any algorithm that approximates cost- to- go functions of general stochastic control problems by computing weights for arbitrary collections of basis functions. [sent-23, score-0.698]
20 2 Stochastic control and linear programming We consider discrete- time stochastic control problems involving a finite state space lSI = N. [sent-24, score-0.836]
21 For each state XES, there is a finite set of available actions A x. [sent-25, score-0.176]
22 Taking action a E A x when the current state is x incurs cost 9a(X) . [sent-26, score-0.249]
23 State transition probabilities Pa(x,y) represent, for each pair (x,y) of states and each action a E A x, the probability that the next state will be y given that the current state is x and the current action is a E Ax. [sent-27, score-0.471]
24 S of cardinality A policy u is a mapping from states to actions. [sent-28, score-0.189]
25 Given a policy u, the dynamics of the system follow a Markov chain with transition probabilities Pu( x)(x, y). [sent-29, score-0.204]
26 For each policy u, we define a transition matrix Pu whose (x,y)th entry is Pu(x)(x,y). [sent-30, score-0.27]
27 The problem of stochastic control amounts to selection of a policy that optimizes a given criterion. [sent-31, score-0.324]
28 In this paper, we will employ as an optimality criterion infinitehorizon discounted cost of the form Ju(x) =E [~(i9U(Xd lxo =x] , where 9u(X) is used as shorthand for 9u(x)(X) and the discount factor a E (0,1) reflects inter- temporal preferences. [sent-32, score-0.175]
29 Optimality is attained by any policy that is greedy with respect to the optimal cost-to-go function J*(x) = minu Ju(x) (a policy u is called greedy with respect to J if TuJ = T J). [sent-33, score-0.587]
30 Let us define operators Tu and T by TuJ = 9u +aPuJ and T J = minu (9u + aPuJ). [sent-34, score-0.218]
31 The optimal cost-to-go function solves uniquely Bellman's equation J = T J. [sent-35, score-0.054]
32 Dynamic programming offers a number of approaches to solving this equation; one of particular relevance to our paper makes use of linear programming, as we will now discuss. [sent-36, score-0.47]
33 T J;::: J, where c is a vector with positive components, which we will refer to as staterelevance wei9hts. [sent-39, score-0.081]
34 It follows that, for any set of positive weights c, J* is the unique solution to (1). [sent-41, score-0.084]
35 YEs Pa(X ,y) J(y) ;::: J(x), Va E A x, so that the optimization problem (1) can be represented as an LP, which we refer to as the exact LP. [sent-43, score-0.133]
36 9a(X) As mentioned in the introduction, state spaces for practical problems are enormous due to the curse of dimensionality. [sent-44, score-0.323]
37 Consequently, the linear program of interest involves prohibitively large numbers of variables and constraints. [sent-45, score-0.147]
38 The approximation algorithm we study reduces dramatically the number of variables. [sent-46, score-0.04]
39 Let us now introduce the linear programming approach to approximate dynamic programming. [sent-47, score-0.643]
40 With an aim of computing a weight vector f E ~K such that If>f is a close approximation to J*, one might pose the following optimization problem: max s. [sent-52, score-0.144]
41 Given a solution f, one might then hope to generate near- optimal decisions by using a policy that is greedy with respect to If>f. [sent-55, score-0.359]
42 As with the case of exact dynamic programming, the optimization problem (2) can be recast as a linear program. [sent-56, score-0.334]
43 We will refer to this problem as the approximate LP. [sent-57, score-0.137]
44 Note that, though the number of variables is reduced to K, the number of constraints remains as large as in the exact LP. [sent-58, score-0.089]
45 Fortunately, we expect that most of the constraints will become irrelevant, and solutions to the linear program can be approximated efficiently, as demonstrated in [3] . [sent-59, score-0.1]
46 3 Error Bounds for the Approximate LP When the optimal cost- to- go function lies within the span of the basis functions, solution of the approximate LP yields the exact optimal cost-to-go function. [sent-60, score-0.715]
47 Unfortunately, it is difficult in practice to select a set of basis functions that contains the optimal cost- to- go function within its span. [sent-61, score-0.41]
48 Instead, basis functions must be based on heuristics and simplified analyses. [sent-62, score-0.147]
49 One can only hope that the span comes close to the desired cost- to- go function. [sent-63, score-0.335]
50 For the approximate LP to be useful , it should deliver good approximations when the cost- to- go function is near the span of selected basis functions. [sent-64, score-0.551]
51 In this section, we present a bound that ensure desirable results of this kind. [sent-65, score-0.086]
52 To set the stage for development of an error bound, let us establish some notation. [sent-66, score-0.09]
53 First, we introduce the weighted norms, defined by 1IJ111 "~ = '"' ')'(x) IJ(x)l , ~ xES IIJlloo "~ = max ')'(x) IJ(x)l, xES for any ')' : S f-t ~+. [sent-67, score-0.103]
54 Note that both norms allow for uneven weighting of errors across the state space. [sent-68, score-0.242]
55 We also introduce an operator H, defined by (HV)(x) = max aEAz L Pa(x, y)V(y), y for all V : S f-t R For any V , (HV)(x) represents the maximum expected value of V (y) if the current state is x and y is a random variable representing the next state. [sent-69, score-0.305]
56 Based on this operator, we define a scalar kv = for each V : S f-t ~. [sent-70, score-0.503]
57 m,:x V(x) - V(x) a(HV)(x) , (3) We interpret the argument V of H as a "Lyapunov function," while we view kv as a "Lyapunov stability factor," in a sense that we will now explain. [sent-71, score-0.504]
58 In the upcoming theorem, we will only be concerned with functions V that are positive and that make kv nonnegative. [sent-72, score-0.575]
59 Also, our error bound for the approximate LP will grow proportionately with kv, and we therefore want kv to be small. [sent-73, score-0.664]
60 At a minimum, kv should be finite , which translates to a condition a(HV)(x) < V(x) , "Ix ES. [sent-74, score-0.529]
61 (4) If a were equal to 1, this would look like a Lyapunov stability condition: the maximum expected value (HV)(x) at the next time step must be less than the current value V(x). [sent-75, score-0.106]
62 Note also that kv becomes smaller as the (HV)(x)'s become small relative to the V(x)'s. [sent-77, score-0.437]
63 Hence, kv conveys a degree of "stability," with smaller values representing stronger stability. [sent-78, score-0.471]
64 For any given function V mapping S to positive reals, we use l/V as shorthand for a function x I-t l/V(x). [sent-80, score-0.099]
65 Note that on the left-hand side of (5), we measure the approximation error with the weighted norm 11衯lkc. [sent-84, score-0.088]
66 Recall that the weight vector c appears in objective function of the approximate LP (2) and must be chosen. [sent-85, score-0.129]
67 In approximating the solution to a given stochastic control problem, it seems sensible to weight more heavily portions of the state space that are visited frequently, so that accuracy will be emphasized in such regions. [sent-86, score-0.569]
68 As discussed in [2], it seems reasonable that the weight vector c should be chosen to reflect the relative importance of each state. [sent-87, score-0.073]
69 Finally, note that the Lyapunov function v plays a central role in the bound of Theorem 3. [sent-88, score-0.122]
70 Its choice influences three terms on the right-hand-side of the bound: 1. [sent-90, score-0.036]
wordName wordTfidf (topN-words)
[('kv', 0.437), ('programming', 0.317), ('hv', 0.26), ('lyapunov', 0.218), ('go', 0.209), ('xes', 0.19), ('lp', 0.164), ('policy', 0.155), ('dynamic', 0.142), ('state', 0.118), ('iij', 0.114), ('curse', 0.114), ('apuj', 0.109), ('cpk', 0.109), ('minu', 0.109), ('tuj', 0.109), ('pu', 0.099), ('stanford', 0.099), ('ju', 0.095), ('approximate', 0.093), ('stochastic', 0.09), ('basis', 0.089), ('exact', 0.089), ('bound', 0.086), ('norms', 0.081), ('span', 0.08), ('control', 0.079), ('pa', 0.073), ('management', 0.069), ('max', 0.068), ('stability', 0.067), ('optimality', 0.066), ('define', 0.066), ('shorthand', 0.062), ('bellman', 0.06), ('functions', 0.058), ('finite', 0.058), ('greedy', 0.057), ('linear', 0.056), ('offers', 0.055), ('action', 0.054), ('optimal', 0.054), ('originally', 0.053), ('practical', 0.052), ('transition', 0.049), ('error', 0.048), ('farias', 0.047), ('lxo', 0.047), ('minr', 0.047), ('preconditions', 0.047), ('prohibitively', 0.047), ('queueing', 0.047), ('reals', 0.047), ('recast', 0.047), ('schweitzer', 0.047), ('solution', 0.047), ('hope', 0.046), ('operator', 0.045), ('approximating', 0.044), ('program', 0.044), ('refer', 0.044), ('uneven', 0.043), ('deliver', 0.043), ('operators', 0.043), ('upcoming', 0.043), ('solving', 0.042), ('development', 0.042), ('render', 0.04), ('lsi', 0.04), ('ah', 0.04), ('emphasized', 0.04), ('portions', 0.04), ('prohibitive', 0.04), ('regularities', 0.04), ('roy', 0.04), ('tu', 0.04), ('unified', 0.04), ('approximation', 0.04), ('ij', 0.04), ('current', 0.039), ('problems', 0.039), ('theorem', 0.038), ('parameterized', 0.038), ('unfortunately', 0.038), ('ready', 0.038), ('heavily', 0.038), ('incurs', 0.038), ('seems', 0.037), ('approximations', 0.037), ('positive', 0.037), ('influences', 0.036), ('benjamin', 0.036), ('weight', 0.036), ('central', 0.036), ('introduce', 0.035), ('conveys', 0.034), ('fortunately', 0.034), ('translates', 0.034), ('pl', 0.034), ('cardinality', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
2 0.2256327 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
3 0.18634306 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
4 0.17787988 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
5 0.13794161 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
6 0.12699188 121 nips-2001-Model-Free Least-Squares Policy Iteration
7 0.1209847 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
8 0.11078092 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
10 0.093975663 1 nips-2001-(Not) Bounding the True Error
11 0.091998272 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
12 0.081757963 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
13 0.075239845 8 nips-2001-A General Greedy Approximation Algorithm with Applications
14 0.075199217 119 nips-2001-Means, Correlations and Bounds
15 0.065617912 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
16 0.064631693 164 nips-2001-Sampling Techniques for Kernel Methods
17 0.062290605 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
18 0.061366584 115 nips-2001-Linear-time inference in Hierarchical HMMs
19 0.061285395 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
20 0.058559474 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
topicId topicWeight
[(0, -0.21), (1, -0.067), (2, 0.273), (3, 0.005), (4, 0.038), (5, 0.035), (6, 0.004), (7, -0.027), (8, 0.011), (9, 0.013), (10, 0.01), (11, 0.025), (12, 0.045), (13, 0.019), (14, -0.035), (15, 0.02), (16, -0.025), (17, -0.052), (18, 0.004), (19, -0.056), (20, 0.011), (21, -0.029), (22, -0.023), (23, -0.217), (24, -0.041), (25, 0.02), (26, 0.014), (27, -0.012), (28, 0.088), (29, 0.047), (30, -0.046), (31, 0.011), (32, 0.021), (33, 0.089), (34, 0.04), (35, 0.11), (36, -0.074), (37, -0.063), (38, -0.065), (39, 0.061), (40, -0.008), (41, -0.028), (42, -0.091), (43, 0.001), (44, -0.107), (45, -0.102), (46, 0.061), (47, -0.037), (48, -0.036), (49, -0.123)]
simIndex simValue paperId paperTitle
same-paper 1 0.9627111 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
2 0.85631341 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
3 0.76911974 128 nips-2001-Multiagent Planning with Factored MDPs
Author: Carlos Guestrin, Daphne Koller, Ronald Parr
Abstract: We present a principled and efficient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efficient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efficient alternative to more complicated algorithms even in the single agent case.
4 0.72603917 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
5 0.70228392 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
6 0.6734578 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
7 0.59458053 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
9 0.53558642 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
10 0.5119493 13 nips-2001-A Natural Policy Gradient
11 0.50739437 177 nips-2001-Switch Packet Arbitration via Queue-Learning
12 0.42220581 119 nips-2001-Means, Correlations and Bounds
13 0.4192571 120 nips-2001-Minimax Probability Machine
14 0.40418607 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
15 0.4020358 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
16 0.40023446 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
17 0.39912081 114 nips-2001-Learning from Infinite Data in Finite Time
18 0.38941017 1 nips-2001-(Not) Bounding the True Error
19 0.37547743 143 nips-2001-PAC Generalization Bounds for Co-training
20 0.35454416 115 nips-2001-Linear-time inference in Hierarchical HMMs
topicId topicWeight
[(14, 0.039), (17, 0.066), (19, 0.041), (27, 0.101), (30, 0.067), (32, 0.24), (38, 0.011), (59, 0.033), (72, 0.105), (79, 0.061), (91, 0.154)]
simIndex simValue paperId paperTitle
Author: A. D'avella, M. C. Tresch
Abstract: The question of whether the nervous system produces movement through the combination of a few discrete elements has long been central to the study of motor control. Muscle synergies, i.e. coordinated patterns of muscle activity, have been proposed as possible building blocks. Here we propose a model based on combinations of muscle synergies with a specific amplitude and temporal structure. Time-varying synergies provide a realistic basis for the decomposition of the complex patterns observed in natural behaviors. To extract time-varying synergies from simultaneous recording of EMG activity we developed an algorithm which extends existing non-negative matrix factorization techniques.
same-paper 2 0.8430174 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
3 0.69326621 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
5 0.68557853 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
Author: Hilbert J. Kappen, Wim Wiegerinck
Abstract: The Cluster Variation method is a class of approximation methods containing the Bethe and Kikuchi approximations as special cases. We derive two novel iteration schemes for the Cluster Variation Method. One is a fixed point iteration scheme which gives a significant improvement over loopy BP, mean field and TAP methods on directed graphical models. The other is a gradient based method, that is guaranteed to converge and is shown to give useful results on random graphs with mild frustration. We conclude that the methods are of significant practical value for large inference problems. 1
6 0.68330085 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
7 0.67944795 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
8 0.67935741 121 nips-2001-Model-Free Least-Squares Policy Iteration
9 0.67816776 59 nips-2001-Direct value-approximation for factored MDPs
10 0.67661273 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
11 0.67509705 1 nips-2001-(Not) Bounding the True Error
12 0.67460674 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
13 0.67439526 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
14 0.67328358 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
15 0.67240262 143 nips-2001-PAC Generalization Bounds for Co-training
16 0.67198169 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
17 0.67186737 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
18 0.67182642 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
19 0.67151248 40 nips-2001-Batch Value Function Approximation via Support Vectors
20 0.671489 135 nips-2001-On Spectral Clustering: Analysis and an algorithm