nips nips2011 nips2011-283 knowledge-graph by maker-knowledge-mining

283 nips-2011-The Fixed Points of Off-Policy TD

Source: pdf

Author: J. Z. Kolter

Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efﬁciently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. [sent-5, score-0.238]

2 In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. [sent-6, score-0.242]

3 In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. [sent-7, score-0.194]

4 Furthermore, we show that we can efﬁciently project on to this convex set using only samples generated from the system. [sent-8, score-0.061]

5 The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. [sent-9, score-0.142]

6 1 Introduction In temporal prediction tasks, Temporal Difference (TD) learning provides a method for learning long-term expected rewards (the “value function”) using only trajectories from the system. [sent-10, score-0.071]

7 The algorithm is ubiquitous in Reinforcement Learning, and there has been a great deal of work studying the convergence properties of the algorithm. [sent-11, score-0.023]

8 , when the states are drawn from the stationary distribution of the policy we are trying to evaluate), the algorithm converges to a well-known ﬁxed point that is guaranteed to be close to the optimal projection of the true value function [17]. [sent-15, score-0.497]

9 When states are sampled offpolicy, standard TD may diverge when using linear function approximation [1], and this has led in recent years to a number of modiﬁed TD algorithms that are guaranteed to convergence even in the presence of off-policy sampling [16, 15, 9, 10]. [sent-16, score-0.24]

10 Of equal importance, however, is the actual quality of the TD solution under off-policy sampling. [sent-17, score-0.113]

11 Previous work, as well as an example we present in this paper, show that in general little can be said about this question: the solution found by TD can be arbitrarily poor in the case of off-policy sampling, even when the true value function is well-approximated by a linear basis. [sent-18, score-0.122]

12 Indeed, a long-standing open question in Reinforcement Learning is whether any a priori guarantees can be made about the solution quality for off-policy methods using function approximation. [sent-20, score-0.153]

13 Furthermore, we show that this set of feasible off-policy sampling distributions is convex, representable via a linear matrix inequality (LMI), and we demonstrate how the set can be approximated and projected onto efﬁciently in the ﬁnite sample setting. [sent-22, score-0.242]

14 The resulting method, which we refer to as TD with distribution optimization (TD-DO), is thus able to guarantee a good approximation to the best possible projected value function, even for off-policy sampling. [sent-23, score-0.141]

15 2 Preliminaries and Background A Markov chain is a tuple, (S, P, R, γ), where S is a set of states, P : S × S → R+ is a transition probability function, R : S → R is a reward function, and γ ∈ [0, 1) is a discount factor. [sent-25, score-0.148]

16 , n}, which allows us to use matrix rather than operator notation. [sent-29, score-0.026]

17 The value function for a Markov chain, V : S → R maps states to their long term discounted sum of rewards, ∞ and is deﬁned as V (s) = E [ t=0 γ t R(st )|s0 = s]. [sent-30, score-0.036]

18 The value function may also be expressed via Bellman’s equation (in vector form) V = R + γP V (1) where R, V ∈ Rn represent vectors of all rewards and values respectively, and P ∈ Rn×n is a matrix of probability transitions Pij = P (s′ = j|s = i). [sent-31, score-0.065]

19 The TD solution is a ﬁxed point of the Bellman operator followed by a projection, i. [sent-33, score-0.056]

20 , ⋆ ⋆ ΦwD = ΠD (R + γP ΦwD ) (2) where ΠD = ΦT (ΦT DΦ)−1 ΦT D is a projection matrix weighted by the diagonal matrix D ∈ Rn×n . [sent-35, score-0.095]

21 Rearranging terms gives the analytical solution −1 ⋆ wD = ΦT D(Φ − γP Φ) ΦT DR. [sent-36, score-0.056]

22 When D i=1 is not the stationary distribution of the Markov chain (i. [sent-38, score-0.275]

23 , we are employing off-policy sampling), then the original TD algorithm may diverge (LSTD will still be able to compute the TD ﬁxed point in this case, but has a greater computational complexity of O(k 2 )). [sent-40, score-0.028]

24 Thus, there has been a great deal of work on developing O(k) algorithms that are guaranteed to converge to the LSTD ﬁxed point even in the case of off-policy sampling [16, 15]. [sent-41, score-0.088]

25 We note that the above formulation avoids any explicit mention of a Markov Decision Process (MDP) or actual policies: rather, we just have tuples of the form {s, r, s′ } where s is drawn from an arbitrary distribution but s′ still follows the “policy” we are trying to evaluate. [sent-42, score-0.097]

26 We use the above notation as it avoids the need for any explicit notation of actions and still captures the off-policy setting completely. [sent-46, score-0.038]

27 1 Error bounds for the TD ﬁxed point Of course, in addition to the issue of convergence, there is the question as to whether we can say anything about the quality of the approximation at this ﬁxed point. [sent-48, score-0.193]

28 9 1 p Figure 1: Counter example for off-policy TD learning: (left) the Markov chain considered for the counterexample; (right) the error of the TD estimate for different off-policy distributions (plotted on a log scale), along with the error of the optimal approximation. [sent-59, score-0.254]

29 (Tsitsiklis and Van Roy [17], Lemma 6) Let wD be the unique solution to ΦwD = ⋆ ΠD (R + γP ΦwD ) where D is the stationary distribution of P . [sent-61, score-0.224]

30 (5) 1−γ Thus, for on-policy sampling with linear function approximation, not only does TD converge to its ﬁxed point, but we can also bound the error of its approximation relative to ΠD V − V D , the lowest possible approximation error for the class of function approximators. [sent-63, score-0.316]

31 A fundamental property of Markov chains [17, Lemma 1] is that transition matrix P is non-expansive in the D norm when D is the stationary distribution P x D ≤ x D , ∀x. [sent-65, score-0.238]

32 When D is not the stationary distribution of the Markov chain, then (6) need not hold, and it remains to be seen what, if anything, can be said a priori about the TD ﬁxed point in this situation. [sent-67, score-0.221]

33 3 An off-policy counterexample Here we present a simple counter-example which shows, for general off-policy sampling, that the TD ﬁxed point can be an arbitrarily poor approximator of the value function, even if the chosen bases can represent the true value function with low error. [sent-68, score-0.164]

34 though we here present a concrete numerical example for illustration. [sent-70, score-0.028]

35 Consider the two-state Markov chain shown in Figure 1, with transition probability matrix P = (1/2)11T , discount factor γ = 0. [sent-72, score-0.174]

36 Then for any ǫ > 0 and C > 0, there exists an off-policy distribution D such that using bases Φ = [1 1. [sent-75, score-0.112]

37 (9) 4141 + 84840ǫ + 40400ǫ2 ⋆ Since this solution is in (0, 1) for all epsilon, by choosing p close to this value, we can make wD arbitrarily large, which in turn makes the error of the TD estimate arbitrarily large. [sent-80, score-0.181]

38 p= 1 The approximation factor can be sharpened to √ 1 1−γ 2 in some settings [18], though the analysis does not carry over to our off-policy case, so we present here the simpler version. [sent-81, score-0.111]

39 Thus, we argue that simple convergence for an off-policy algorithm is not a sufﬁcient criterion for a good learning system, since even for a convergent algorithm the quality of the actual solution could be arbitrarily poor. [sent-86, score-0.193]

40 To motivate the approach, we again note that error bounds for the on-policy TD algorithm follow from the Markov chain property that P x D ≤ x D for all x when D is the stationary distribution. [sent-88, score-0.327]

41 However, ﬁnding a D that satisﬁes this condition is no easier than computing the stationary distribution directly and thus is not a feasible approach. [sent-89, score-0.207]

42 Instead, we consider a relaxed contraction property: that the transition matrix P followed by a projection onto the bases will be non-expansive for any function already in the span of Φ. [sent-90, score-0.243]

43 Formally, we want to consider distributions D for which ΠD P Φw D ≤ Φw (10) D k for any w ∈ R . [sent-91, score-0.045]

44 This deﬁnes a convex set of distributions, since ΠD P Φw ⇔ ⇔ 2 D ≤ Φw 2 D −1 wT ΦT P T DΦ(ΦT DΦ) w T T T T ΦT DΦ(ΦT DΦ)−1 ΦDP ΦT w ≤ wT ΦT DΦw −1 Φ P DΦ(Φ DΦ) T (11) T ΦDP Φ − Φ DΦ w ≤ 0. [sent-92, score-0.04]

45 This holds for all w if and only if2 ΦT P T DΦ(ΦT DΦ)−1 ΦDP ΦT − ΦT DΦ which in turn holds if and only if 0 (12) 3 F ≡ ΦT DΦ ΦT DP Φ T T Φ P DΦ ΦT DΦ 0 (13) This is a matrix inequality (LMI) in D, and thus describes a convex set. [sent-93, score-0.066]

46 Although the distribution D is too high-dimensional to optimize directly, analogous to LSTD, the F matrix deﬁned above is of a representable size (2k × 2k), and can be approximated from samples. [sent-94, score-0.136]

47 We will return to this point in the subsequent section, and for now will continue to use the notation of the true distribution D for simplicity. [sent-95, score-0.037]

48 The chief theoretical result of this section is that if we restrict our attention to off-policy distributions within this convex set, we can prove non-trivial bounds about the approximation error of the TD ﬁxed point. [sent-96, score-0.257]

49 Let w⋆ be the unique solution to Φw⋆ = ΠD (R+γP Φw⋆ ) where D is any distribution 1/2 ¯ satisfying (13). [sent-98, score-0.093]

50 Further, let Dµ be the stationary distribution of P , and let D ≡ D−1/2 Dµ Then4 ¯ 1 + γκ(D) ⋆ ΦwD − V D ≤ ΠD V − V D . [sent-99, score-0.168]

51 When D = Dµ , κ(D) = 1, so we recover the original bound up to a constant factor. [sent-101, score-0.025]

52 Even though the bound does include this term that depends on the distance from the stationary distribution, no such bound is possible for D that do not satisfy the convex constraint (13), as illustrated by the previous counter-example. [sent-102, score-0.27]

53 » – A B Using the Schur complement property that 0 ⇔ B T AB − C 0 [2, pg 650-651]. [sent-104, score-0.035]

54 9 1 Figure 2: Counter example from Figure 1 shown with the set of all valid distributions for which F 0. [sent-115, score-0.045]

55 Restricting the solution to this region avoids the possibility of the high error solution. [sent-116, score-0.184]

56 D (16) Similarly, the second term in (15) can be bounded as ΠD P ΠD V − ΠD P V D ≤ P ΠD V − P V where P D denotes the matrix norm A back into (15) gives ⋆ (1 − γ) ΦwD − V so all the remains is to show that P max x D Dµ ≤1 D ≡ max x D ≤ P D Ax D. [sent-120, score-0.026]

57 D ≤1 ΠD V − V (18) D Dµ ≤ max x Dµ ≤1 x Dµ (17) Substituting these bounds ≤ (1 + γ P D ) ΠD V − V D , ¯ ≤ κ(D). [sent-121, score-0.038]

58 Indeed, applying a theorem from this work we can arrive as a slight improvement of the bound above [13], but the focus here is just on the general form and possibility of the bound. [sent-125, score-0.045]

59 Returning to the counter-example from the previous section, we can visualize the feasible region for which F 0, shown as the shaded portion in Figure 2, and so constraining the solution to this feasible region avoids the possibility of the high error solution. [sent-126, score-0.281]

60 Moreover, in this example the optimal TD error occurs exactly at the point where λmin (F ) = 0, so that projecting an off-policy distribution onto this set will give an optimal solution for initially infeasible distributions. [sent-127, score-0.167]

61 Placing a weight di on each sample, we could optimize the sum F (d) = m ˆi and obtain a tractable optimization problem. [sent-130, score-0.11]

62 However, optimizing these weights freely is i=1 di F not permissible, since this procedure allows us to choose di = dj even if s(i) = s(j) , which violates the weights in the original LMI. [sent-131, score-0.173]

63 However, if we additionally require that s(i) = s(j) ⇒ di = dj (or more appropriately for continuous features and states, for example that di − dj → 0 as φ(s(i) ) − φ(s(j) ) → 0 according to some norm) then we are free to optimize over these empirical distribution weights. [sent-132, score-0.275]

64 In practice, we want to constrain this distribution in a manner commensurate with the complexity of the feature space and the number of samples. [sent-133, score-0.055]

65 However, determining the best such distributions to use in practice remains an open problem for future work in this area. [sent-134, score-0.045]

66 ˆ Finally, since many empirical distributions satisfy F (d) 0, we propose to “project” the empirical distribution onto this set by minimizing the KL divergence between the observed and optimized ˆ distributions, subject to the constraint that F (d) 0. [sent-135, score-0.126]

67 Since this constraint is guaranteed to hold at the stationary distribution, the intuition here is that by moving closer to this set, we will likely obtain a better solution. [sent-136, score-0.181]

68 Formally, the ﬁnal optimization problem, which we refer to as the TD-DO method (Temporal Difference Distribution Optimization), is given by m min −ˆi log di p d ˆ s. [sent-137, score-0.086]

69 (22) i=1 where C is some convex set that respects the metric constraints described above. [sent-140, score-0.064]

70 This is a convex optimization problem in d, and thus can be solved efﬁciently, though off-the-shelf solvers can perform quite poorly, especially for large dimension m. [sent-141, score-0.088]

71 2 Efﬁcient Optimization Here we present a ﬁrst-order optimization method based on solving the dual of (22). [sent-143, score-0.054]

72 By properly exploiting the decomposability of the objective and low-rank structure of the dual problem, we develop an iterative optimization method where each gradient step can be computed very efﬁciently. [sent-144, score-0.072]

73 For simplicity we present the algorithm ignoring the constraint set C, though we discuss possible additonal constraints brieﬂy in supplementary material. [sent-146, score-0.049]

74 We begin by forming the Lagrangian of (22), introducing Lagrange multipliers Z ∈ R2k×2k for the ˆ constraint F (d) 0 and ν ∈ R for the constraint 1T d = 1. [sent-147, score-0.042]

75 This leads to the dual optimization problem m ˆ pi log di − tr(Z T F (d)) + ν(1T d − 1) . [sent-148, score-0.12]

76 ˆ max min − Z 0,ν d (23) i=1 Treating Z as ﬁxed, we maximize over ν and minimize over d in (23) using an equality-constrained, feasible start Newton method [2, pg 528]. [sent-149, score-0.074]

77 05 0 2 10 3 4 5 6 7 Number of Bases 8 9 10 Figure 3: Average approximation error of the TD methods, using different numbers of bases functions, for the random Markov chain (left) and diffusion chain (right). [sent-163, score-0.483]

78 05 0 0 10 3 10 mu 1 2 10 10 Closeness to Stationary Distribution, C 3 10 mu Figure 4: Average approximation error, using off-policy distributions closer or further from the stationary distribution (see text) for the random Markov chain (left) and diffusion chain (right). [sent-171, score-0.628]

79 7 3 10 Number of Samples 4 10 Figure 5: Average approximation error for TD methods computed via sampling, for different numbers of samples, for random Markov chain (left) and diffusion chain (right). [sent-182, score-0.408]

80 Although this is now a non-convex problem, local optimization of this objective is still guaranteed to give a global solution to the original semideﬁnite problem, provided we choose the rank of Y to be sufﬁcient to represent the optimal solution [5]. [sent-185, score-0.161]

81 The gradient of this ˆ transformed problem is ∇Z g(Y Y T ) = −2F (d)Y , which can be computed in time O(mkp) since ˆi term is a low-rank matrix, and we optimize the dual objective via an off-the-shelf LBFGS each F solver [12, 14]. [sent-186, score-0.058]

82 Though it is difﬁcult to bound p aprirori, we can check after the solution that our chosen value was sufﬁcient for the global solution, and we have found that very low values (p = 1 or p = 2) were sufﬁcient in our experiments. [sent-187, score-0.081]

83 5 Experiments Here we present simple simulation experiments illustrating our proposed approach; while the evaluation is of course small scale, the results highlight the potential of TD-DO to improve TD algorithms both practically as well as theoretically. [sent-188, score-0.023]

84 Since the beneﬁts of the method are clearest in terms of 7 0. [sent-189, score-0.018]

85 5 Figure 3 shows the average approximation error of the different algorithms with differing numbers of basis function, over 1000 domains. [sent-206, score-0.116]

86 In this and all experiments other than those evaluating the effect of sampling, we use the full Φ and P matrices to compute the convex set, so that we are evaluating the performance of the approach in the limit of large numbers of samples. [sent-207, score-0.076]

87 We evaluate ˆ the approximation error V − V D where D is the off-policy sampling distribution (so as to be as favorable as possible to off-policy TD). [sent-208, score-0.212]

88 In all cases the TD-DO algorithm improves upon the off-policy TD, though the degree of improvement can vary from minor to quite signiﬁcant. [sent-209, score-0.028]

89 Figure 4 shows a similar result for varying the closeness of the sampling distribution to the stationary distribution; in our experiments, the off-policy distribution is sampled according to D ∼ Dir(1 + Cµ µ) where µ denotes the stationary distribution. [sent-210, score-0.441]

90 As expected, the off-policy approaches perform similarly for larger Cµ (approaching the stationary distribution), with TD-DO having a clear advantage when the off-policy distribution is far from the stationary distribution. [sent-211, score-0.299]

91 In Figure 5 we consider the effect of sampling on the algorithms. [sent-212, score-0.059]

92 6 Conclusion The fundamental idea we have presented in this paper is that by considering a convex subset of off-policy distributions (and one which can be computed efﬁciently from samples), we can provide performance guarantees for the TD ﬁxed point. [sent-215, score-0.103]

93 Although left for future work, we suspect that the same techniques we present here can also be extending to these other cases, potentially providing a wide range of analogous results that still apply under off-policy sampling. [sent-217, score-0.018]

94 We thank the reviewers for helpful comments and Bruno Scherrer for pointing out a potential improvement to the error bound. [sent-219, score-0.051]

95 5 Experimental details: For the random Markov Chain rows of P are drawn IID from a Dirichlet distribution, and the reward and bases are random normal, with |S| = 11. [sent-222, score-0.075]

96 On the convergence of stochastic iterative dynamic programming algorithms. [sent-247, score-0.023]

97 Low-rank optimization on the cone of positive semideﬁnite matrices. [sent-255, score-0.02]

98 Regularization and feature selection in least-squares temporal difference learning. [sent-262, score-0.032]

99 A convergent O(n) algorithm for off-policy temporal-different learning with linear function approximation. [sent-331, score-0.02]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('td', 0.83), ('wd', 0.253), ('policy', 0.221), ('stationary', 0.131), ('chain', 0.107), ('lstd', 0.106), ('diffusion', 0.078), ('bases', 0.075), ('lbfgs', 0.071), ('di', 0.066), ('approximation', 0.065), ('kolter', 0.061), ('lmi', 0.06), ('sampling', 0.059), ('solution', 0.056), ('unboundedly', 0.053), ('error', 0.051), ('markov', 0.05), ('szepesvari', 0.049), ('maei', 0.046), ('closeness', 0.046), ('dp', 0.045), ('distributions', 0.045), ('projection', 0.043), ('tsitsiklis', 0.041), ('dj', 0.041), ('convex', 0.04), ('rewards', 0.039), ('feasible', 0.039), ('newton', 0.038), ('avoids', 0.038), ('bounds', 0.038), ('distribution', 0.037), ('arbitrarily', 0.037), ('states', 0.036), ('contraction', 0.036), ('quality', 0.035), ('pg', 0.035), ('bellman', 0.035), ('anything', 0.035), ('dual', 0.034), ('semide', 0.034), ('normalized', 0.033), ('zico', 0.033), ('counter', 0.033), ('temporal', 0.032), ('rk', 0.032), ('representable', 0.031), ('reinforcement', 0.03), ('said', 0.029), ('counterexample', 0.029), ('mu', 0.029), ('guaranteed', 0.029), ('mdp', 0.028), ('though', 0.028), ('bhatnagar', 0.028), ('diverge', 0.028), ('clusters', 0.026), ('matrix', 0.026), ('bound', 0.025), ('optimize', 0.024), ('respects', 0.024), ('priori', 0.024), ('xed', 0.024), ('illustrating', 0.023), ('approximator', 0.023), ('onto', 0.023), ('returning', 0.023), ('px', 0.023), ('convergence', 0.023), ('transition', 0.022), ('chains', 0.022), ('actual', 0.022), ('samples', 0.021), ('constraint', 0.021), ('wt', 0.021), ('convergent', 0.02), ('question', 0.02), ('optimization', 0.02), ('possibility', 0.02), ('sutton', 0.02), ('region', 0.019), ('projected', 0.019), ('discount', 0.019), ('brie', 0.018), ('evaluating', 0.018), ('relaxed', 0.018), ('rn', 0.018), ('analogous', 0.018), ('guarantees', 0.018), ('sharpened', 0.018), ('minfunc', 0.018), ('clearest', 0.018), ('chief', 0.018), ('pursing', 0.018), ('bruno', 0.018), ('countable', 0.018), ('absil', 0.018), ('commensurate', 0.018), ('decomposability', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 283 nips-2011-The Fixed Points of Off-Policy TD

Author: J. Z. Kolter

2 0.72823918 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a speciﬁc model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which deﬁnes the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1

3 0.3742058 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efﬁciency in different circumstances, there is little speciﬁc evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1

4 0.21635127 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

5 0.19433522 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

6 0.13259995 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

7 0.12664285 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

8 0.11407657 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

9 0.10219061 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

10 0.090663642 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

11 0.073138267 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.072673209 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

13 0.067643546 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

14 0.065814175 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

15 0.061442312 264 nips-2011-Sparse Recovery with Brownian Sensing

16 0.054289468 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

17 0.050507732 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

18 0.050409485 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

19 0.050014142 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

20 0.049858913 229 nips-2011-Query-Aware MCMC

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.173), (1, -0.188), (2, 0.115), (3, 0.21), (4, -0.388), (5, 0.077), (6, 0.011), (7, -0.031), (8, 0.004), (9, -0.015), (10, 0.269), (11, 0.022), (12, 0.2), (13, 0.198), (14, 0.142), (15, 0.286), (16, -0.257), (17, -0.288), (18, -0.154), (19, 0.088), (20, 0.064), (21, -0.043), (22, 0.04), (23, 0.02), (24, 0.018), (25, -0.016), (26, 0.02), (27, 0.003), (28, -0.087), (29, -0.015), (30, -0.002), (31, -0.028), (32, -0.002), (33, 0.088), (34, 0.032), (35, 0.025), (36, 0.01), (37, -0.036), (38, 0.014), (39, -0.02), (40, -0.052), (41, 0.048), (42, 0.036), (43, 0.016), (44, -0.011), (45, 0.067), (46, 0.001), (47, -0.037), (48, 0.059), (49, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97186822 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

same-paper 2 0.93924969 283 nips-2011-The Fixed Points of Off-Policy TD

Author: J. Z. Kolter

3 0.64790595 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

4 0.46392202 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

5 0.37946048 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We ﬁrst prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

6 0.36562121 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

7 0.34455529 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

8 0.3216854 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

9 0.31888551 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

10 0.2854096 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

11 0.27895492 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

12 0.27574885 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

13 0.23166899 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

14 0.22453728 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

15 0.2236436 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

16 0.21143511 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

17 0.20806304 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

18 0.19273834 95 nips-2011-Fast and Accurate k-means For Large Datasets

19 0.19142027 264 nips-2011-Sparse Recovery with Brownian Sensing

20 0.18991978 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (4, 0.058), (11, 0.021), (14, 0.146), (20, 0.023), (26, 0.028), (31, 0.158), (33, 0.024), (43, 0.057), (45, 0.111), (57, 0.025), (65, 0.018), (74, 0.072), (83, 0.053), (84, 0.014), (99, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84687746 61 nips-2011-Contextual Gaussian Process Bandit Optimization

Author: Andreas Krause, Cheng S. Ong

Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process deﬁned over the joint context-action space, and develop CGP-UCB, an intuitive upper-conﬁdence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1

same-paper 2 0.83367676 283 nips-2011-The Fixed Points of Off-Policy TD

Author: J. Z. Kolter

3 0.82436562 168 nips-2011-Maximum Margin Multi-Instance Learning

Author: Hua Wang, Heng Huang, Farhad Kamangar, Feiping Nie, Chris H. Ding

Abstract: Multi-instance learning (MIL) considers input as bags of instances, in which labels are assigned to the bags. MIL is useful in many real-world applications. For example, in image categorization semantic meanings (labels) of an image mostly arise from its regions (instances) instead of the entire image (bag). Existing MIL methods typically build their models using the Bag-to-Bag (B2B) distance, which are often computationally expensive and may not truly reﬂect the semantic similarities. To tackle this, in this paper we approach MIL problems from a new perspective using the Class-to-Bag (C2B) distance, which directly assesses the relationships between the classes and the bags. Taking into account the two major challenges in MIL, high heterogeneity on data and weak label association, we propose a novel Maximum Margin Multi-Instance Learning (M3 I) approach to parameterize the C2B distance by introducing the class speciﬁc distance metrics and the locally adaptive signiﬁcance coefﬁcients. We apply our new approach to the automatic image categorization tasks on three (one single-label and two multilabel) benchmark data sets. Extensive experiments have demonstrated promising results that validate the proposed method.

4 0.82271755 75 nips-2011-Dynamical segmentation of single trials from population neural data

Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identiﬁed by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identiﬁed without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the ﬁring rate of an isolated neuron based on the ﬁring rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reﬂect its computational role. 1

5 0.82156157 229 nips-2011-Query-Aware MCMC

Author: Michael L. Wick, Andrew McCallum

Abstract: Traditional approaches to probabilistic inference such as loopy belief propagation and Gibbs sampling typically compute marginals for all the unobserved variables in a graphical model. However, in many real-world applications the user’s interests are focused on a subset of the variables, speciﬁed by a query. In this case it would be wasteful to uniformly sample, say, one million variables when the query concerns only ten. In this paper we propose a query-speciﬁc approach to MCMC that accounts for the query variables and their generalized mutual information with neighboring variables in order to achieve higher computational efﬁciency. Surprisingly there has been almost no previous work on query-aware MCMC. We demonstrate the success of our approach with positive experimental results on a wide range of graphical models. 1

6 0.81909174 249 nips-2011-Sequence learning with hidden units in spiking neural networks

7 0.81855679 241 nips-2011-Scalable Training of Mixture Models via Coresets

8 0.81828034 66 nips-2011-Crowdclustering

9 0.81619239 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

10 0.81210041 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

11 0.81154937 180 nips-2011-Multiple Instance Filtering

12 0.81047451 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

13 0.80806744 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

14 0.80595547 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

15 0.80406505 301 nips-2011-Variational Gaussian Process Dynamical Systems

16 0.80343837 221 nips-2011-Priors over Recurrent Continuous Time Processes

17 0.80034977 102 nips-2011-Generalised Coupled Tensor Factorisation

18 0.80001163 258 nips-2011-Sparse Bayesian Multi-Task Learning

19 0.79969668 158 nips-2011-Learning unbelievable probabilities

20 0.7977429 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search