nips nips2011 nips2011-278 knowledge-graph by maker-knowledge-mining

278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Source: pdf

Author: George Konidaris, Scott Niekum, Philip S. Thomas

Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a speciﬁc model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which deﬁnes the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a speciﬁc model of how the variance of an nstep return estimate increases with n. [sent-6, score-0.454]

2 We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which deﬁnes the TDγ family of complex-backup temporal difference learning algorithms. [sent-7, score-0.129]

3 We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. [sent-8, score-0.498]

4 We then derive a second algorithm, TDγ (C), with a capacity parameter C. [sent-9, score-0.025]

5 TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. [sent-10, score-0.117]

6 We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. [sent-11, score-0.067]

7 Efﬁcient value function learning algorithms are crucial to this process and have been the focus of a great deal of reinforcement learning research. [sent-13, score-0.079]

8 The most successful and widely-used family of value function algorithms is the TD(λ) family [2]. [sent-14, score-0.128]

9 The TD(λ) family forms an estimate of return, called the λ-return, that blends both low variance, bootstrapped and biased temporal-difference estimates of return with high variance, unbiased Monte Carlo estimates of return using a parameter λ. [sent-15, score-0.694]

10 Our goal is to understand the modeling assumptions implicit in the λ-return formulation and improve them. [sent-17, score-0.019]

11 We show that the λ-return estimator is only a maximum-likelihood estimator of return given a speciﬁc model of how the variance of an n-step return estimate increases with n. [sent-18, score-0.69]

12 We propose a more accurate model of that variance increase and use it to obtain an expression for a new return estimator, the γ-return. [sent-19, score-0.274]

13 This results in the TDγ family of algorithms, of which we derive TDγ , the γ-return version of the original TD(λ) algorithm. [sent-20, score-0.08]

14 TDγ is only suitable for the batch setting where updates occur at the end of the episode and requires time and space proportional to the length of the episode, ∗ All three authors are primary authors on this occasion. [sent-21, score-0.29]

15 We then derive a second algorithm, TDγ (C), which requires C times more time and memory than TD(λ) and can be used in an incremental and online setting. [sent-23, score-0.142]

16 We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. [sent-24, score-0.067]

17 Monte Carlo algorithms (for episodic tasks) do not use intermediate estimates but instead use the full return sample directly: L−1 γ i rt+i , MC Rst = (2) i=0 for an episode L transitions in length after time t. [sent-27, score-0.635]

18 These two types of return estimates can be considered instances of the more general notion of an n-step return sample, for n ≥ 1: (n) Rst = rt + γrt+1 + γ 2 rt+2 + . [sent-28, score-0.602]

19 (3) Here, n transitions are observed from the MDP and the remaining portion of return is estimated using the value function. [sent-32, score-0.256]

20 The important observation here is that all n-step return estimates can be used simultaneously for learning. [sent-33, score-0.273]

21 The TD(λ) family of algorithms accomplishes this using a parameter λ ∈ [0, 1] to average n-step return estimates, according to the following equation: ∞ λ Rst (n+1) λn Rst . [sent-34, score-0.312]

22 = (1 − λ) (4) n=0 Note that for any episodic MDP we always obtain a ﬁnite episode length. [sent-35, score-0.239]

23 The usual formulation of an episodic MDP uses absorbing terminal states—states where only zero-reward self-transitions are available. [sent-36, score-0.129]

24 In such cases the n-step returns past the end of the episode are all equal, and therefore TD(λ) allocates the weights corresponding to all of those return estimates to the ﬁnal transition. [sent-37, score-0.63]

25 λ Rst , known as the λ-return, is an estimator that blends one-step temporal difference estimates (which are biased, but low variance) at λ = 0 and Monte Carlo estimates (which are unbiased, but high variance) at λ = 1. [sent-38, score-0.3]

26 This forms the target for the entire family of TD(λ) algorithms, whose members differ largely in their use of the resulting estimates to update the value function. [sent-39, score-0.106]

27 Viewing this as a statistical estimation problem where each n-step return is a sample of the true return, under what conditions and for what model is equation (4) a good estimator for return? [sent-42, score-0.362]

28 The most salient feature of the n-step returns is that their variances increase with n. [sent-43, score-0.126]

29 Therefore, con(n) sider the following model: each n-step return estimate Rst is sampled from a Gaussian distribution 1 centered on the true return, Rst , with variance k(n) that is some increasing function of n. [sent-44, score-0.294]

30 If we ˆ assume the n-step returns are independent,2 then the likelihood function for return estimate Rst is: L (1) (n) ˆ L(Rst |Rst , . [sent-45, score-0.343]

31 n=1 1 2 We should note that this assumption is not quite true: only the Monte Carlo return is unbiased. [sent-49, score-0.222]

32 2 (5) ˆ Maximizing the log of this equation obtains the maximum likelihood estimator for Rst : ˆ Rst = L −1 (n) Rst n=1 k(n) . [sent-52, score-0.164]

33 L −1 n=1 k(n) (6) Thus, we obtain a weighted sum: each n-step return is weighted by the inverse of its variance and the entire sum is normalized so that the weights sum to 1. [sent-53, score-0.293]

34 From here we can see that if we let L go to inﬁnity and set k(n) = λ−n , 0 ≤ λ ≤ 1, then we obtain the λ-return estimator in equation (4), ∞ since n=0 λn = 1/(1 − λ). [sent-54, score-0.14]

35 Thus, λ-return (as used in the TD(λ) family of algorithms) is the maximum-likelihood estimator of return under the following assumptions: 1. [sent-55, score-0.364]

36 The n-step returns from a given state are independent. [sent-56, score-0.133]

37 The n-step returns from a given state are normally distributed with a mean of the true return. [sent-58, score-0.133]

38 The variances of the n-step returns from each state increase according to a geometric progression in n, with common ratio λ−1 . [sent-60, score-0.158]

39 In this view, the variance of an n-step sample return increases geometrically with n and λ parametrizes the shape of this geometric increase. [sent-62, score-0.274]

40 Examining the last term, we see that: (n−1) Cov Rst , −γ n−1 V (st+n−1 ) + γ n−1 rt+n−1 + γ n V (st+n ) (9) (n−1) (n) (n−1) =Cov Rst , Rst − Rst (n−1) (n) (n−1) (n−1) , Rst − Cov Rst , Rst =Cov Rst (11) (n−1) (n) (n−1) =Cov Rst , Rst − V ar Rst . [sent-64, score-0.103]

41 (n−1) (10) (12) (n) Since Rst and Rst are highly correlated—being two successive return samples—we assume that (n−1) (n) (n−1) (n) (n−1) Cov[Rst , Rst ] ≈ V ar[Rst ] (equality holds when Rst and Rst are perfectly correlated). [sent-65, score-0.222]

42 Hence, equation (8) becomes: (n) (n−1) V ar Rst ≈ V ar Rst + γ 2(n−1) V ar V (st+n−1 ) − (rt+n−1 + γV (st+n )) . [sent-67, score-0.362]

43 (13) Notice that the ﬁnal term on the right hand side of equation (13) is the discounted variance of the temporal difference error n-steps after the current state. [sent-68, score-0.198]

44 We assume that this variance is roughly the same for all states; let that value be κ. [sent-69, score-0.052]

45 Since κ also approximates the variance of the 1-step return (i. [sent-70, score-0.274]

46 , k(1) = κ), we obtain the following simple model of the variance of an n-step sample of return: n γ 2(i−1) κ. [sent-72, score-0.052]

47 This estimator has the virtue of being parameter-free since the κ values cancel. [sent-74, score-0.087]

48 Therefore, we need not estimate κ— under the assumption of independent, Gaussian n-step returns with variances increasing according to equation (13), equation (15) is the maximum likelihood estimator for any value of κ. [sent-75, score-0.339]

49 We call this estimator the γ-return since it weights the n-step returns according to the discount factor. [sent-76, score-0.236]

50 Figure 1 compares the weightings obtained using λ-return and γ-return for a few example trajectory lengths. [sent-77, score-0.181]

51 First, the λ-return weightings spike at the end of the trajectory, whereas the γ-return weightings do not. [sent-79, score-0.223]

52 This occurs because even though any sample trajectory has ﬁnite length, the λ-return as deﬁned in equation (4) is actually an inﬁnite sum; the remainder of the weight mass is allocated to the Monte Carlo return. [sent-80, score-0.164]

53 This allows the normalizing factor in equation (4) to be a constant, rather than having it depend on the length of the trajectory, as it does in equation (15) for the γ-return. [sent-81, score-0.185]

54 This signiﬁcantly complicates the problem of obtaining an incremental algorithm using γ-return, as we shall see in later sections. [sent-82, score-0.098]

55 05 0 0 5 10 15 20 25 0 30 Return Estimate Length 0 5 10 15 20 25 30 Return Estimate Length Figure 1: Example weights for trajectories of various lengths for λ-return (left) and γ-return (right). [sent-102, score-0.065]

56 Second, while the λ-return weightings tend to zero over time, the γ-return weightings tend toward a constant. [sent-103, score-0.182]

57 This means that very long trajectories will be effectively “cut-off” after some point and have effectively no contribution to the λ-return, whereas after a certain length in the γ-return all n-step returns have roughly equal weighting. [sent-104, score-0.196]

58 This also complicates the problem of obtaining an incremental algorithm using γ-return: even if we were to assume inﬁnitely many n-step returns past the end of the episode, the normalizing constant would not become ﬁnite. [sent-105, score-0.27]

59 Nevertheless, we can use the γ-return estimator to obtain an entire family of TDγ learning algorithms; for any TD(λ) algorithm we can derive an equivalent TDγ algorithm. [sent-106, score-0.167]

60 In the following section, we derive TDγ , the γ-return equivalent of the original TD(λ) algorithm. [sent-107, score-0.025]

61 4 TDγ Given a set of m trajectories T = {τ1 , τ2 , . [sent-108, score-0.046]

62 , τm }, where lτ = |τ | denotes the number of τ ˆ (sτ , rt , sτ ) tuples in the trajectory τ . [sent-111, score-0.197]

63 Using approximator Vθ with parameters θ to approximate t t+1 4 V , the objective function for regression is: 1 E(θ) = 2 1 = 2 Because lτ −t n=1 lτ −1 2 γ ˆ t Rsτ − Vθ (sτ ) t (17) τ ∈T t=0 lτ −1 2 lτ −t w(n, lτ − τ ∈T t=0 (n) t)Rsτ t ˆ t − Vθ (sτ ) . [sent-112, score-0.024]

64 We can substitute in n = u − t (where u is the current time step, st is the state we want to update the value estimate of, and n is the length of the n-step return that ends at the current time step) to get: lτ −1 lτ (u−t) ∆θ = −α −w(u − t, lτ − t) Rsτ t ˆ − Vθ (sτ ) t ˆ τ θ Vθ (st ). [sent-116, score-0.44]

65 (22) τ ∈T t=0 u=t+1 Swapping the sums allows us to derive the backward version of TDγ : lτ u−1 (u−t) −w(u − t, lτ − t) Rsτ t ∆θ = −α ˆ t − Vθ (sτ ) ˆ τ θ Vθ (st ). [sent-117, score-0.046]

66 This leads to i TDγ for episodic tasks (given in Algorithm 1), which eliminates the eligibility trace parameter λ. [sent-119, score-0.16]

67 For episode length lτ and feature vector size F , the algorithm can be implemented with time complexity of O(lτ F ) per step and space complexity O(lτ F ). [sent-120, score-0.228]

68 Thus, Algorithm 1 can only be used for batch updates where each episode’s trajectory data is stored until an update is performed at the end of an episode; this is often undesirable, and in continuing tasks, impossible. [sent-122, score-0.152]

69 First, the weight normalization is a constant and does not depend on the length of the episode. [sent-124, score-0.07]

70 Thus, TD(λ) need only store the sum of the return estimates from each state, rather than having to store each individually. [sent-126, score-0.375]

71 One approach to deriving an incremental algorithm is to use only the ﬁrst C n-step returns from each state, similar to truncated temporal differences [8]. [sent-127, score-0.242]

72 This eliminates the ﬁrst barrier: weight normalization no longer relies on the episode length, except for the ﬁnal C − 1 states, which can be corrected for after the episode ends. [sent-128, score-0.432]

73 This approach has time complexity O(CF ) and space complexity O(CF )—and is therefore C times more expensive than TD(λ)—and replaces λ with the more intuitive parameter C rather than eliminating it, but it affords the incremental TDγ (C) algorithm given in Algorithm 2. [sent-129, score-0.064]

74 Note that setting C = 1 obtains TD(0), and C ≥ lτ obtains TDγ . [sent-130, score-0.048]

75 The ﬁrst is a 10 × 10 discrete gridworld with the goal in the bottom right, deterministic movement in the four cardinal directions, a terminal reward of +10 and −0. [sent-132, score-0.128]

76 For the gridworld, the agent selected one of the optimal actions with probability 0. [sent-135, score-0.026]

77 4, and each of the other actions with probability 0. [sent-136, score-0.026]

78 The second domain is the 5-state discrete chain domain [1] with random transitions to the left and right, and γ = 0. [sent-138, score-0.105]

79 The third domain is the pendulum swing-up task [7] with a reward of 1. [sent-140, score-0.102]

80 0 for entering a terminal state (the pendulum is nearly vertical) and zero elsewhere, and γ = 0. [sent-141, score-0.128]

81 The fourth domain is mountain car [1] with γ = 0. [sent-145, score-0.068]

82 99, and using actions from a hand-coded policy with probability 0. [sent-146, score-0.065]

83 The ﬁfth and ﬁnal domain is the acrobot [1] with a terminal reward of 10 and −0. [sent-148, score-0.15]

84 In all cases the start state was selected uniformly from the set of nonterminal states. [sent-152, score-0.032]

85 A 5th order Fourier basis [9] was used as the function approximator for the 3 continuous domains. [sent-153, score-0.024]

86 TDγ outperforms TD(λ) for all settings of λ in 4 out of the 5 domains. [sent-155, score-0.025]

87 In the chain domain TDγ performs better than most settings of λ but slightly worse than the optimal setting. [sent-156, score-0.069]

88 These near-identical estimates will accumulate a large fraction of the weighting (see Figure 1) and come to dominate the γ-return estimate. [sent-159, score-0.051]

89 This suggests that once the returns start to become almost identical they should not be considered independent samples and should instead be discarded. [sent-160, score-0.101]

90 We may be able to calculate bounds on the diminishing differences between n-step returns due to γ, or empirically track the point at which those differences begin to diminish. [sent-162, score-0.168]

91 Another avenue for future research is deriving a version of TDγ or TDγ (C) that provably converges for off-policy data with function approximation, most likely using recent insights on gradient-descent based TD algorithms [10]. [sent-163, score-0.044]

92 We have shown that the widely used λ-return formulation is the maximum-likelihood estimator of return given three assumptions (see section 2). [sent-165, score-0.328]

93 We expect that re-evaluating all three will prove a fruitful avenue for future research. [sent-167, score-0.026]

94 2 5710 5690 5670 5650 5630 Mountain Car Acrobot Figure 2: Mean squared error (MSE) over sample trajectories from ﬁve benchmark domains for TD(λ) with various settings of λ, TDγ , and TDγ (C), for various settings of C. [sent-323, score-0.136]

95 Each result is the minimum MSE (weighted by state visitation frequency) between each algorithm’s approximation and the correct value function (obtained using a very large number of Monte Carlo samples), found by searching over α at increments of 0. [sent-325, score-0.049]

96 Learning to predict by the methods of temporal differences. [sent-340, score-0.055]

97 GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. [sent-358, score-0.047]

98 Temporal difference Bayesian model averaging: A Bayesian perspective on adapting lambda. [sent-363, score-0.041]

99 Truncating temporal differences: On the efﬁcient implementation of TD(λ) for reinforcement learning. [sent-371, score-0.116]

100 Value function approximation in reinforcement learning using the Fourier basis. [sent-379, score-0.061]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('td', 0.833), ('return', 0.222), ('episode', 0.179), ('rs', 0.137), ('st', 0.117), ('rt', 0.107), ('ar', 0.103), ('returns', 0.101), ('weightings', 0.091), ('trajectory', 0.09), ('estimator', 0.087), ('mse', 0.074), ('incremental', 0.064), ('cov', 0.062), ('reinforcement', 0.061), ('episodic', 0.06), ('family', 0.055), ('temporal', 0.055), ('eliminates', 0.053), ('memory', 0.053), ('equation', 0.053), ('variance', 0.052), ('store', 0.051), ('terminal', 0.051), ('estimates', 0.051), ('konidaris', 0.05), ('length', 0.049), ('maei', 0.047), ('eligibility', 0.047), ('lambda', 0.047), ('gridworld', 0.047), ('trajectories', 0.046), ('pendulum', 0.045), ('carlo', 0.042), ('acrobot', 0.042), ('backups', 0.042), ('monte', 0.041), ('gamma', 0.041), ('end', 0.041), ('policy', 0.039), ('mdp', 0.039), ('blends', 0.037), ('transitions', 0.034), ('niekum', 0.034), ('complicates', 0.034), ('discard', 0.033), ('state', 0.032), ('ru', 0.03), ('normalizing', 0.03), ('reward', 0.03), ('discount', 0.029), ('silver', 0.027), ('amherst', 0.027), ('domain', 0.027), ('avenue', 0.026), ('actions', 0.026), ('george', 0.026), ('scott', 0.026), ('derive', 0.025), ('variances', 0.025), ('settings', 0.025), ('obtains', 0.024), ('approximator', 0.024), ('cf', 0.024), ('diminishing', 0.023), ('mountain', 0.022), ('differences', 0.022), ('adapting', 0.022), ('intermediate', 0.022), ('ri', 0.022), ('backward', 0.021), ('updates', 0.021), ('afosr', 0.021), ('weight', 0.021), ('sutton', 0.02), ('benchmark', 0.02), ('estimate', 0.02), ('fourier', 0.02), ('domains', 0.02), ('unbiased', 0.019), ('assumptions', 0.019), ('difference', 0.019), ('discounted', 0.019), ('weights', 0.019), ('car', 0.019), ('hamid', 0.018), ('osentoski', 0.018), ('absorbing', 0.018), ('truncating', 0.018), ('algorithms', 0.018), ('action', 0.018), ('chain', 0.017), ('biased', 0.017), ('szepesvari', 0.017), ('accomplishes', 0.017), ('visitation', 0.017), ('allocates', 0.017), ('mahadevan', 0.017), ('sridhar', 0.017), ('rewards', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

2 0.72823918 283 nips-2011-The Fixed Points of Off-Policy TD

Author: J. Z. Kolter

Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efﬁciently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1

3 0.41166645 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

Abstract: There is much evidence that humans and other animals utilize a combination of model-based and model-free RL methods. Although it has been proposed that these systems may dominate according to their relative statistical efﬁciency in different circumstances, there is little speciﬁc evidence — especially in humans — as to the details of this trade-off. Accordingly, we examine the relative performance of different RL approaches under situations in which the statistics of reward are differentially noisy and volatile. Using theory and simulation, we show that model-free TD learning is relatively most disadvantaged in cases of high volatility and low noise. We present data from a decision-making experiment manipulating these parameters, showing that humans shift learning strategies in accord with these predictions. The statistical circumstances favoring model-based RL are also those that promote a high learning rate, which helps explain why, in psychology, the distinction between these strategies is traditionally conceived in terms of rulebased vs. incremental learning. 1

4 0.17596577 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

5 0.11823343 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

6 0.089489445 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

7 0.078917332 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

8 0.076568969 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

9 0.061844975 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

10 0.057059817 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

11 0.056820061 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

12 0.049911149 264 nips-2011-Sparse Recovery with Brownian Sensing

13 0.048407733 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

14 0.045999404 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

15 0.044303946 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

16 0.036594532 145 nips-2011-Learning Eigenvectors for Free

17 0.036537051 219 nips-2011-Predicting response time and error rates in visual search

18 0.035756275 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

19 0.035371128 68 nips-2011-Demixed Principal Component Analysis

20 0.035332929 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, -0.164), (2, 0.127), (3, 0.206), (4, -0.319), (5, 0.057), (6, 0.009), (7, -0.057), (8, 0.017), (9, -0.016), (10, 0.269), (11, 0.002), (12, 0.221), (13, 0.244), (14, 0.183), (15, 0.317), (16, -0.274), (17, -0.319), (18, -0.158), (19, 0.065), (20, 0.068), (21, -0.015), (22, 0.007), (23, 0.026), (24, 0.06), (25, 0.002), (26, 0.013), (27, 0.023), (28, -0.093), (29, -0.032), (30, -0.021), (31, -0.056), (32, -0.003), (33, 0.068), (34, 0.056), (35, 0.022), (36, 0.028), (37, -0.048), (38, 0.035), (39, -0.009), (40, -0.019), (41, 0.067), (42, 0.061), (43, 0.028), (44, -0.005), (45, 0.042), (46, 0.022), (47, -0.029), (48, 0.045), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97643322 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

2 0.88731915 283 nips-2011-The Fixed Points of Off-Policy TD

Author: J. Z. Kolter

3 0.63859725 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

Author: Dylan A. Simon, Nathaniel D. Daw

4 0.35433233 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

5 0.30410212 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

6 0.2756063 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

7 0.27175483 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

8 0.25218436 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

9 0.18282719 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

10 0.17937841 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

11 0.17929669 34 nips-2011-An Unsupervised Decontamination Procedure For Improving The Reliability Of Human Judgments

12 0.17590579 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

13 0.16643596 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

14 0.16561981 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

15 0.16510676 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

16 0.16253428 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals

17 0.15842488 264 nips-2011-Sparse Recovery with Brownian Sensing

18 0.15516585 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

19 0.15327272 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

20 0.15130581 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (4, 0.03), (11, 0.018), (20, 0.024), (26, 0.031), (31, 0.241), (33, 0.023), (41, 0.144), (43, 0.05), (45, 0.101), (57, 0.05), (74, 0.047), (83, 0.031), (99, 0.08)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87445307 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

Author: David R. Karger, Sewoong Oh, Devavrat Shah

Abstract: Crowdsourcing systems, in which tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for humanpowered solving of large scale problems in domains such as image classiﬁcation, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all crowdsourcers must devise schemes to increase conﬁdence in their answers, typically by assigning each task multiple times and combining the answers in some way such as majority voting. In this paper, we consider a general model of such crowdsourcing tasks, and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm signiﬁcantly outperforms majority voting and, in fact, is asymptotically optimal through comparison to an oracle that knows the reliability of every worker. 1

same-paper 2 0.8726247 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

3 0.86598217 240 nips-2011-Robust Multi-Class Gaussian Process Classification

Author: Daniel Hernández-lobato, Jose M. Hernández-lobato, Pierre Dupont

Abstract: Multi-class Gaussian Process Classiﬁers (MGPCs) are often affected by overﬁtting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the labels illustrate the beneﬁts of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difﬁcult to classify correctly in practice. 1

4 0.85982531 241 nips-2011-Scalable Training of Mixture Models via Coresets

Author: Dan Feldman, Matthew Faulkner, Andreas Krause

Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models ﬁtting the coreset will also provide a good ﬁt for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points sufﬁces for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efﬁciently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1

5 0.85941744 23 nips-2011-Active dendrites: adaptation to spike-based communication

Author: Balazs B. Ujfalussy, Máté Lengyel

Abstract: Computational analyses of dendritic computations often assume stationary inputs to neurons, ignoring the pulsatile nature of spike-based communication between neurons and the moment-to-moment ﬂuctuations caused by such spiking inputs. Conversely, circuit computations with spiking neurons are usually formalized without regard to the rich nonlinear nature of dendritic processing. Here we address the computational challenge faced by neurons that compute and represent analogue quantities but communicate with digital spikes, and show that reliable computation of even purely linear functions of inputs can require the interplay of strongly nonlinear subunits within the postsynaptic dendritic tree. Our theory predicts a matching of dendritic nonlinearities and synaptic weight distributions to the joint statistics of presynaptic inputs. This approach suggests normative roles for some puzzling forms of nonlinear dendritic dynamics and plasticity. 1

6 0.85752958 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

7 0.85728759 225 nips-2011-Probabilistic amplitude and frequency demodulation

8 0.84860146 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

9 0.84726566 249 nips-2011-Sequence learning with hidden units in spiking neural networks

10 0.83896863 229 nips-2011-Query-Aware MCMC

11 0.8251105 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning

12 0.82453126 75 nips-2011-Dynamical segmentation of single trials from population neural data

13 0.82297289 221 nips-2011-Priors over Recurrent Continuous Time Processes

14 0.81415504 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

15 0.81052506 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

16 0.81010962 131 nips-2011-Inference in continuous-time change-point models

17 0.80532205 301 nips-2011-Variational Gaussian Process Dynamical Systems

18 0.80492765 66 nips-2011-Crowdclustering

19 0.80441296 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

20 0.80316085 215 nips-2011-Policy Gradient Coagent Networks