nips nips2001 nips2001-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rémi Munos
Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. [sent-10, score-0.28]
2 Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. [sent-11, score-0.26]
3 Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. [sent-12, score-0.292]
4 Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. [sent-13, score-0.108]
5 Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. [sent-14, score-0.327]
6 To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy. [sent-16, score-1.064]
7 Introduction Assume that we model a complex decision-making problem under uncertainty by a finite MDP. [sent-17, score-0.15]
8 Because of the limited resources used, the parameters of the MDP (transition probabilities and rewards) are uncertain: we assume that we only know a belief state over their possible values. [sent-18, score-0.337]
9 IT we select the most probable values of the parameters, we can build a MDP and solve it to deduce the corresponding optimal policy. [sent-19, score-0.269]
10 However, because of the uncertainty over the true parameters, this policy may not be the one that maximizes the expected cumulative rewards of the true (but partially unknown) decision-making problem. [sent-20, score-0.646]
11 We can nevertheless use sampling techniques to estimate the expected loss of using this policy. [sent-21, score-0.331]
12 Furthermore, if we assume independence of the parameters (considered as random variables), we are able to derive the contribution of the uncertainty over each parameter to this expected loss. [sent-22, score-0.385]
13 As a consequence, we can predict where adding new resoUrces (thus decreasing the uncertainty over some parameters) will decrease mostly this loss, thus improving the MDP model of the decision-making problem so as to optimize the expected future rewards. [sent-23, score-0.304]
14 As possible application, in model-free RL we may wish to minimize the amount of real-world exploration (because each experiment is highly costly). [sent-24, score-0.063]
15 Following [1] we can maintain a Dirichlet pdf over the transition probabilities of the corresponding MDP. [sent-25, score-0.332]
16 Then, our algorithm is able to predict in which parts of the state space we should make new experiments, thus decreasing the uncertainty over some parameters (the posterior distribution being less uncertain than the prior) in order to optimize the expected payoff. [sent-26, score-0.409]
17 Another application concerns the approximation of continuous (or large discrete) state-space control problems using variable resolution grids, that requires an efficient resource allocation process in order to survive the "curse of dimensionality" in high dimensions. [sent-27, score-0.304]
18 For a given grid, because of the interpolation process, the approximate back-up operator introduces a local interpolation error (see [4]) that may be considered as a random variable (for example in the random grids of [6]). [sent-28, score-0.271]
19 The algorithm introduced in this paper allows to estimate where we should add new grid-points, thus decreasing the uncertainty over the local interpolation error, in order to increase the expected performance of the new grid representation. [sent-29, score-0.406]
20 The main tool developed here is the calculation of the partial derivative of useful global measures (the value function or the loss of using a sub-optimal policy) with respect to each parameter (probabilities and rewards) of a MDP. [sent-30, score-0.683]
21 A transition from a state x, action a to a next state y occurs with probability p(Ylx, a) and the corresponding (deterministic) reward is r(x, a). [sent-32, score-0.273]
22 We introduce the back-up operator T a defined, for any function W : X --t JR, as T a W(x) == (' LP(Ylx, a)W(y) + r(x, a) (1) y (with some discount factor 0 < (' < 1). [sent-33, score-0.057]
23 The optimal policy 1[* is the mapping from any state x to· the action 1[* (x) that maximizes the Q-values: 1[*(x) == maxaEA Q(x, a). [sent-37, score-0.405]
24 The parameters of the MDP - the probability and the reward functions - are not perfectly known: all we know is a pdf over their possible values. [sent-38, score-0.308]
25 This uncertainty comes from the limited amount of allocated resources for estimating those parameters. [sent-39, score-0.295]
26 Let us choose a specific policy 1r (for example the optimal policy of the MDP with the most probable parameters). [sent-40, score-0.653]
27 We can estimate the expected loss of using 1r instead of the true (but unknown) optimal policy 1[*. [sent-41, score-0.649]
28 Let us write J-t == {Pj} the set of all parameters (p and r functions) of a MDP. [sent-42, score-0.065]
29 We assume that we know a probability distribution function pdf(J-Lj) over their possible values. [sent-43, score-0.029]
30 t defined by its parameters P, we write pJL (y Ix, a), r JL (x, a), V JL, QJL, and 7f1-! [sent-45, score-0.109]
31 ' respectively its transition probabilities, rewards, value function, Q-values, and optimal policy. [sent-46, score-0.134]
32 By definition, the optimal gain in MJL is VJL(x) == ]JL (x; 7fJL) which is obtained for the optimal policy 7fIL. [sent-49, score-0.444]
33 Let ~ (x) == ]JL (x; if) be the approximate gain obtained for some approximate policy . [sent-50, score-0.398]
34 We define the loss to occur LJL(x) from X when one uses the approximate policy 7r instead of the optimal one 7fJL in MJL: LIL(X) == VIL(x) - ~(x) (3) An example of approximate policy 1? [sent-52, score-0.951]
35 would be the optimal policy of the most probable MDP, defined by the most probable parameters fi(ylx, a) and r(x, a). [sent-53, score-0.55]
36 We also consider the problem of maximizing the global gain from a set of initial states chosen according to some probability distribution P(x). [sent-54, score-0.197]
37 Accordingly, we define the global gain of a policy 11"": ]JL(7f) == Ex ]JL(x; 7f)P(x) and the global loss LIL of using some approximate policy 7r instead of the optimal one nIL (4) Thus, knowing the pdf over all parameters J-l we can define the expected global loss L == EJL[LIL]. [sent-55, score-1.907]
38 Next, we would like to define what is the contribution of each parameter uncertainty to this loss, so we know where we should add new resources (thus reducing some parameters uncertainty) in order to decrease the expected global loss. [sent-56, score-0.738]
39 We would like to estimate, for each parameter J-lj, (5) E[8L I Add 8u units of resource for Pj] 1. [sent-57, score-0.105]
40 2 Partial derivative of the loss ill order to quantify (5) we need to be more explicit about the pdf over JL. [sent-58, score-0.569]
41 First, we assume the independence of the parameters JLj (considered as random variables). [sent-59, score-0.038]
42 Suppose that pdf (JLj) == N (0, U j) (normal distribution of mean 0 and standard deviation Uj). [sent-60, score-0.188]
43 We would like to estimate the variation 8L of the expected loss L when we make a small change of the uncertainty over Pj (consequence of adding new resources), for example when changing the standard deviation of 8aj in pdf(J. [sent-61, score-0.519]
44 At the limit of an infinitesimal variation we obtain the partial derivative which 3 when computed for all parameters J-lj, provides the respective contributions of each parameter's uncertainty to the global loss. [sent-63, score-0.612]
45 , Another example is when the pdf(pj) is a uniform distribution of support [-b j , bj ]. [sent-65, score-0.032]
46 Then the partial contribution of JLj'S uncertainty to the global loss can be expressed More generally, we can define a finite number of characteristic scalar meaas 3 surements of the pdf uncertainty (for example the entropy or the moments) and gf. [sent-66, score-1.097]
47 compute the partial derivative of the expected global loss with respect to these coefficients. [sent-67, score-0.722]
48 Finally, knowing the actual resources needed to estimate a parameter J. [sent-68, score-0.243]
49 3 Unbiased estimator We sample N sets of parameters {J. [sent-74, score-0.094]
50 For convenience, we use the superscript i to refer to the i-th MDP sample and the subscript j for the j-th parameter of a variable. [sent-82, score-0.073]
51 For each MDP, we compute the global loss L i of using the policy 'if and estimate the expected global loss: L ~ 2:::1 L i . [sent-85, score-0.783]
52 In order to estimate the contribution of a p-arameter's uncertainty to L, we derive the partial derivative of L with respect to the characteristic coefficients of pdf (J-tj ). [sent-86, score-0.771]
53 tj that follows a normal distribution N(O, Uj), we can write J. [sent-89, score-0.027]
54 The partial derivative of the expected loss L with respect to Uj is -1 8 8 aL == a E/L~N(o. [sent-92, score-0.626]
55 1)[8aLue ~j] (6) ~ from which we deduce the unbiased estimator 8L '" ~ aUj - N t i=l i Jt; (7) 8L aJ. [sent-95, score-0.228]
56 tj Uj where ~;; is the partial derivative of the global loss Li of MDP M i with respect to the parameter J. [sent-97, score-0.683]
57 For other distributions, we can define similar results to (6) and deduce analogous estimators (for uniform distributions, we have the same estimator with bj instead of Uj). [sent-100, score-0.307]
58 1 Non-local dependencies Influence of a markov chain In [3] we introduced the notion of influence of a Markov Chain as a way to measure value function/rewards correlations between states. [sent-105, score-0.26]
59 Intuitively, I(ylx) measures the expected discounted number of visits of state y starting from x; it is also the partial derivative of the value function Vex) with respect to the reward r(y). [sent-107, score-0.577]
60 Similarly, the influence I(ylf(·)) can be obtained as limit of the iterations I(ylf(·)) +-, LP(Ylw)I(wlf(·)) + fey) w Thus the computation of the influence I(ylf(·)) is cheap (equivalent to solving a Markov chain). [sent-109, score-0.304]
61 2 Total derivative of V We wish to express the contribution of all parameters - transition probabilities p and rewards r - (considered as variables) to the value function V by defining the total derivative of V as a function of those P¥ameters. [sent-111, score-0.776]
62 We recall that the total 8t f derivative of a function f of several variables Xl, . [sent-112, score-0.195]
63 We already know that the partial derivative of Vex) with respect to the reward r(z) is the influence I(zjx) = ~~~1. [sent-118, score-0.56]
64 Now, the dependency with respect to the transition probabilities has to be expressed more carefully because the probabilities p(wlz) for a given z are dependent (they sum to one). [sent-119, score-0.243]
65 A way to express that is provided in the theorem that follows whose proof is in [2]. [sent-120, score-0.027]
66 Theorelll 1 For a given state z, let us alter the probabilities p(wlz), for all w, with some c5'p(wlz) value, such that 2:: w c5'p(wlz) = o. [sent-121, score-0.125]
67 We deduce the total derivative of v: dV(x) = L1(zlx)[, L z V(w)dp(wlz) + dr(z)] w under the constraint 2::w dp( wi z) = 0 for all z. [sent-123, score-0.331]
68 3 Total derivative of the loss , For a given MDP M with parameters J. [sent-124, score-0.419]
69 L (for notation simplification we do not write the JL superscript in what follows), we want to estimate the loss of using an approximate policy 7? [sent-126, score-0.605]
70 We can prove that L(x) is the expected discounted cumulative one-step losses l(Xk) for reachable states Xk: L(x) == E[L I'k l(Xk)lxo == x;n] k with the expectation taken in the same sense as in (2). [sent-129, score-0.207]
71 Similarly, we decompose the one-step loss l(x) == Q(x,7f(x)) - Q(x, n(x)) == r(x,1f(x)) - r(x,7f(x)) + L [q(ylx,1f(x)) - q(ylx,n(x))] r(y,7f(Y)) y == r(x, 7f(x)) -r(x, 7? [sent-132, score-0.246]
72 (x)) + Ll(Ylx)r(y, 7f(Y)) y as a function of the partial contributions l(ylx) == q(ylx,1f(x)) - q(ylx, n(x)) (see figure 1). [sent-133, score-0.167]
73 o q (ylx ,IT ) q (ylx ,11- ) Figure 1: The reward r(y,1r(Y)) at state y contributes to the one-step loss l(x) = Q(x, 1r(x)) - Q(x, 1? [sent-134, score-0.33]
74 2 Total derivative of the one-step loss and global loss Similarly to section (2. [sent-138, score-0.697]
75 2), we wish to express the contribution of all parameters transition probabilities p and rewards r - (considered as variables) to the one-step loss function by defining the total derivative of I as a function of those parameters. [sent-139, score-0.835]
76 Theorem 2 Let us introduce the (formal) differential back-up operator dT a defined, for any function W : X ~ JR, as dT a W(x) == ry L W(y)dp(ylx, a) + dr(x, a) y (similar to the back-up operator (1) but using dp and dr instead of p and r). [sent-140, score-0.329]
77 The total derivative of the one-step loss is dl(x). [sent-141, score-0.415]
78 In order to compute the estimator (7) we calculate the partial derivatives ~~; based on the result of the previous section, with the following algorithm~ Given the pdf over the parameters j. [sent-146, score-0.444]
79 (for example the optimal policy of the most probable MDP). [sent-148, score-0.393]
80 N, solve each MDP M i and deduce its value function Vi, Q-values Qi, and optimal policy 7ri . [sent-151, score-0.454]
81 Compute the influence I(xIP(·)) (which depends on the transition probabilities pi of M i ) and the influence I(li(xl·)IP(·)) from which we deduce i ar ~(Li ). [sent-153, score-0.584]
82 Then calculate Li(x) by solving Bellman's equation Li = SL and deduce x,a 8P,r~:,a). [sent-154, score-0.136]
83 These partial derivatives enable to compute the unbiased estimator (7). [sent-155, score-0.254]
84 The complexity of solving a discounted MDP with K states, each one connected to M next states, is O(KM), as is the complexity of computing the influences. [sent-156, score-0.061]
85 A relev-ant resource allocation process would consider adding new computational resources to reduce uncertainty over the true value of those parameters. [sent-160, score-0.484]
86 In the examples given in the introduction, this would be doing new experiments in model-free RL for defining more precisely the transition probabilities of some relevant states. [sent-161, score-0.183]
87 In discretization techniques for continuous control problems, this would be adding new grid points in order to improve the quality of the interpolation at relevant areas of the state space in order to maximize the expected gain of the new policy. [sent-162, score-0.397]
88 Initial experiments for variable resolution discretization using random grids show improved performance compared to [3]. [sent-163, score-0.128]
89 Influence and variance of a markov chain : Application to adaptive discretizations in optimal control. [sent-173, score-0.166]
90 Rates of convergence for variable resolution schemes in optimal control. [sent-177, score-0.09]
wordName wordTfidf (topN-words)
[('ylx', 0.507), ('policy', 0.26), ('mdp', 0.252), ('loss', 0.22), ('pdf', 0.188), ('derivative', 0.161), ('ylw', 0.16), ('influence', 0.152), ('uncertainty', 0.15), ('resources', 0.145), ('vex', 0.137), ('deduce', 0.136), ('partial', 0.134), ('rewards', 0.123), ('wlz', 0.119), ('jl', 0.116), ('dp', 0.116), ('xk', 0.103), ('global', 0.096), ('remi', 0.092), ('wlx', 0.092), ('ylf', 0.092), ('allocation', 0.087), ('define', 0.083), ('uj', 0.08), ('expected', 0.08), ('zlx', 0.08), ('transition', 0.076), ('contribution', 0.076), ('probable', 0.075), ('li', 0.071), ('lp', 0.069), ('jlj', 0.069), ('lil', 0.069), ('mjl', 0.069), ('munos', 0.069), ('gain', 0.068), ('probabilities', 0.068), ('dl', 0.064), ('interpolation', 0.064), ('resource', 0.064), ('chain', 0.061), ('discounted', 0.061), ('pj', 0.058), ('optimal', 0.058), ('operator', 0.057), ('state', 0.057), ('estimator', 0.056), ('reward', 0.053), ('dr', 0.051), ('grids', 0.051), ('uncertain', 0.048), ('ry', 0.048), ('markov', 0.047), ('dsl', 0.046), ('ecole', 0.046), ('maxaea', 0.046), ('polytechnique', 0.046), ('concerns', 0.045), ('discretization', 0.045), ('grid', 0.045), ('defined', 0.044), ('parameter', 0.041), ('dt', 0.041), ('lxo', 0.04), ('survive', 0.04), ('jr', 0.039), ('defining', 0.039), ('parameters', 0.038), ('adding', 0.038), ('mdps', 0.037), ('bellman', 0.037), ('decreasing', 0.036), ('unbiased', 0.036), ('efficient', 0.036), ('approximate', 0.035), ('rl', 0.034), ('exploration', 0.034), ('total', 0.034), ('contributions', 0.033), ('states', 0.033), ('andrew', 0.033), ('cumulative', 0.033), ('resolution', 0.032), ('bj', 0.032), ('curse', 0.032), ('superscript', 0.032), ('respect', 0.031), ('estimate', 0.031), ('action', 0.03), ('costly', 0.03), ('know', 0.029), ('pl', 0.029), ('highly', 0.029), ('derivatives', 0.028), ('theorem', 0.027), ('write', 0.027), ('ix', 0.027), ('knowing', 0.026), ('decompose', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
Author: Rémi Munos
Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.
2 0.20639323 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
3 0.19506894 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
4 0.18134032 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
5 0.16472253 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.14502099 128 nips-2001-Multiagent Planning with Factored MDPs
7 0.14146817 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.13610949 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
10 0.1209847 36 nips-2001-Approximate Dynamic Programming via Linear Programming
11 0.11713091 139 nips-2001-Online Learning with Kernels
12 0.10850601 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
13 0.098833278 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
14 0.07835833 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
15 0.075211503 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
16 0.073777974 183 nips-2001-The Infinite Hidden Markov Model
17 0.067756161 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
18 0.066114902 22 nips-2001-A kernel method for multi-labelled classification
19 0.062356267 8 nips-2001-A General Greedy Approximation Algorithm with Applications
20 0.059166618 163 nips-2001-Risk Sensitive Particle Filters
topicId topicWeight
[(0, -0.201), (1, -0.079), (2, 0.327), (3, 0.022), (4, 0.021), (5, 0.043), (6, 0.02), (7, 0.005), (8, 0.03), (9, -0.02), (10, 0.025), (11, 0.033), (12, 0.024), (13, -0.002), (14, -0.013), (15, -0.092), (16, 0.008), (17, 0.011), (18, 0.007), (19, 0.047), (20, 0.097), (21, 0.085), (22, 0.023), (23, -0.127), (24, 0.009), (25, -0.048), (26, -0.076), (27, -0.017), (28, 0.014), (29, 0.115), (30, -0.041), (31, -0.086), (32, 0.064), (33, 0.098), (34, 0.063), (35, 0.07), (36, -0.11), (37, 0.024), (38, -0.065), (39, -0.023), (40, 0.059), (41, 0.084), (42, 0.001), (43, 0.133), (44, -0.003), (45, 0.014), (46, 0.039), (47, 0.011), (48, -0.092), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.97100347 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
Author: Rémi Munos
Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.
2 0.79654843 121 nips-2001-Model-Free Least-Squares Policy Iteration
Author: Michail G. Lagoudakis, Ronald Parr
Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.
3 0.73767793 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
4 0.72019637 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
5 0.65676028 36 nips-2001-Approximate Dynamic Programming via Linear Programming
Author: Daniela Farias, Benjamin V. Roy
Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach
6 0.6438148 40 nips-2001-Batch Value Function Approximation via Support Vectors
7 0.62068951 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
8 0.6131258 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
10 0.5626936 128 nips-2001-Multiagent Planning with Factored MDPs
11 0.55752796 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning
12 0.48721111 177 nips-2001-Switch Packet Arbitration via Queue-Learning
13 0.4378663 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
14 0.38030589 139 nips-2001-Online Learning with Kernels
15 0.37229794 137 nips-2001-On the Convergence of Leveraging
16 0.37226129 183 nips-2001-The Infinite Hidden Markov Model
17 0.35332131 115 nips-2001-Linear-time inference in Hierarchical HMMs
18 0.33322084 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
19 0.31770736 22 nips-2001-A kernel method for multi-labelled classification
20 0.31471604 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
topicId topicWeight
[(14, 0.044), (17, 0.506), (19, 0.02), (27, 0.065), (30, 0.022), (38, 0.015), (59, 0.024), (72, 0.047), (79, 0.039), (83, 0.022), (91, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.92192131 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
Author: Rémi Munos
Abstract: It is desirable that a complex decision-making problem in an uncertain world be adequately modeled by a Markov Decision Process (MDP) whose structural representation is adaptively designed by a parsimonious resources allocation process. Resources include time and cost of exploration, amount of memory and computational time allowed for the policy or value function representation. Concerned about making the best use of the available resources, we address the problem of efficiently estimating where adding extra resources is highly needed in order to improve the expected performance of the resulting policy. Possible application in reinforcement learning (RL) , when real-world exploration is highly costly, concerns the detection of those areas of the state-space that need primarily to be explored in order to improve the policy. Another application concerns approximation of continuous state-space stochastic control problems using adaptive discretization techniques for which highly efficient grid points allocation is mandatory to survive high dimensionality. Maybe surprisingly these two problems can be formulated under a common framework: for a given resource allocation, which defines a belief state over possible MDPs, find where adding new resources (thus decreasing the uncertainty of some parameters -transition probabilities or rewards) will most likely increase the expected performance of the new policy. To do so, we use sampling techniques for estimating the contribution of each parameter's probability distribution function (Pdf) to the expected loss of using an approximate policy (such as the optimal policy of the most probable MDP) instead of the true (but unknown) policy.
2 0.8992973 136 nips-2001-On the Concentration of Spectral Properties
Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola
Abstract: We consider the problem of measuring the eigenvalues of a randomly drawn sample of points. We show that these values can be reliably estimated as can the sum of the tail of eigenvalues. Furthermore, the residuals when data is projected into a subspace is shown to be reliably estimated on a random sample. Experiments are presented that confirm the theoretical results. 1
3 0.8705529 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
Author: Shai Fine, Katya Scheinberg
Abstract: We propose a framework based on a parametric quadratic programming (QP) technique to solve the support vector machine (SVM) training problem. This framework, can be specialized to obtain two SVM optimization methods. The first solves the fixed bias problem, while the second starts with an optimal solution for a fixed bias problem and adjusts the bias until the optimal value is found. The later method can be applied in conjunction with any other existing technique which obtains a fixed bias solution. Moreover, the second method can also be used independently to solve the complete SVM training problem. A combination of these two methods is more flexible than each individual method and, among other things, produces an incremental algorithm which exactly solve the 1-Norm Soft Margin SVM optimization problem. Applying Selective Sampling techniques may further boost convergence. 1
4 0.50660545 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning
Author: Eyal Even-dar, Yishay Mansour
Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1
5 0.49447632 59 nips-2001-Direct value-approximation for factored MDPs
Author: Dale Schuurmans, Relu Patrascu
Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1
6 0.48694956 121 nips-2001-Model-Free Least-Squares Policy Iteration
7 0.48110503 134 nips-2001-On Kernel-Target Alignment
8 0.47243041 13 nips-2001-A Natural Policy Gradient
9 0.45535991 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
10 0.4509334 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
11 0.44971567 36 nips-2001-Approximate Dynamic Programming via Linear Programming
12 0.44658536 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
13 0.43585068 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
15 0.4337711 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance
16 0.42986614 89 nips-2001-Grouping with Bias
17 0.42377186 24 nips-2001-Active Information Retrieval
18 0.41278183 171 nips-2001-Spectral Relaxation for K-means Clustering
19 0.41098446 40 nips-2001-Batch Value Function Approximation via Support Vectors
20 0.41014796 31 nips-2001-Algorithmic Luckiness