nips nips2011 nips2011-237 knowledge-graph by maker-knowledge-mining

237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

Source: pdf

Author: Andre S. Barreto, Doina Precup, Joelle Pineau

Abstract: Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems. In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to ﬁx the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difﬁcult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL’s model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and ﬁtted Q-iteration. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. [sent-9, score-0.225]

2 We resort to a special decomposition of a transition matrix, called stochastic factorization, to ﬁx the size of the approximator while at the same time incorporating all the information contained in the data. [sent-12, score-0.127]

3 The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. [sent-13, score-0.156]

4 The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difﬁcult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. [sent-15, score-0.332]

5 Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and ﬁtted Q-iteration. [sent-17, score-0.109]

6 1 Introduction Recent years have witnessed the emergence of several reinforcement-learning techniques that make it possible to learn a decision policy from a batch of sample transitions. [sent-18, score-0.151]

7 Among them, Ormoneit and Sen’s kernel-based reinforcement learning (KBRL) stands out for two reasons [1]. [sent-19, score-0.067]

8 Second, KBRL is consistent in the statistical sense, meaning that adding more data always improves the quality of the resulting policy and eventually leads to optimal performance. [sent-21, score-0.084]

9 Despite its nice theoretical properties, KBRL has not been widely adopted by the reinforcement learning community. [sent-22, score-0.083]

10 As discussed by Ormoneit and Glynn [2], KBRL can be seen as the derivation of a ﬁnite Markov decision process whose number of states coincides with the number of sample transitions collected to perform the approximation. [sent-24, score-0.2]

11 This gives rise to a dilemma: on the one hand one wants as much data as possible to describe the dynamics of the decision problem, but on the other hand the number of transitions should be small enough to allow for the numerical solution of the resulting model. [sent-25, score-0.127]

12 We rely on a special decomposition of a transition matrix, called stochastic factorization, to rewrite it as the product of two stochastic matrices of smaller dimension. [sent-27, score-0.162]

13 As we 1 will see, the stochastic factorization possesses a very useful property: if we swap its factors, we obtain another transition matrix which retains some fundamental characteristics of the original one. [sent-28, score-0.184]

14 The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster than KBRL but still converges to a unique solution. [sent-30, score-0.156]

15 We empirically show that the proposed approach is able to compress the information contained in KBRL’s model, outperforming both the least-squares policy iteration algorithm and ﬁtted Q-iteration on the tasks studied [3, 4]. [sent-33, score-0.149]

16 A ﬁnite MDP is deﬁned by a tuple M ≡ (S, A, Pa , ra , γ) [5]. [sent-35, score-0.164]

17 The matrix Pa ∈ R|S|×|S| gives the transition probabilities associated with action a ∈ A and the vector ra ∈ R|S| stores the corresponding expected rewards. [sent-37, score-0.236]

18 Let the operator Γ : R|S|×|A| → R|S| be given by ΓQ = v, with vi = max j qi j , and deﬁne ∆ : R|S| → R|S|×|A| as ∆v = Q, where the ath column of Q is given by qa = ra + γPa v. [sent-42, score-0.164]

19 A fundamental result in dynamic programming states that, starting from v(0) = 0, the expression v(t) = Γ∆v(t−1) gives the optimal t-step value function, and as t → ∞ the vector v(t) approaches v∗ , from which any optimal decision policy π ∗ can be derived [5]. [sent-43, score-0.215]

20 Consider now an MDP with continuous state space S ⊂ Rd and let Sa = {(sa , ra , sa )|k = 1, 2, . [sent-44, score-0.383]

21 , na } k k ˆk ˆk be a set of sample transitions associated with action a ∈ A, where sa , sa ∈ S and ra ∈ R. [sent-47, score-0.678]

22 Since only transitions ending in the states sa have a ˆ non-zero probability of occurrence, one can solve a ﬁnite MDP M whose space is composed solely ˆ of these n = ∑a na states [2, 6]. [sent-49, score-0.433]

23 After the optimal value function of M has been found, the value of ˆ s any state si ∈ S can be computed as Q(si , a) = ∑na κ a (si , sa ) ra + γ V ∗ (ˆ a ) . [sent-50, score-0.44]

24 Ormoneit and Sen [1] k k k k=1 proved that, if na → ∞ for all a ∈ A and the widths of the kernels κ a shrink at an “admissible” rate, the probability of choosing a suboptimal action based on Q(si , a) converges to zero. [sent-51, score-0.113]

25 3 Stochastic factorization A stochastic matrix has only non-negative elements and each of its rows sums to 1. [sent-55, score-0.14]

26 That said, we can introduce the concept that will serve as a cornerstone for the rest of the paper: Deﬁnition 1 Given a stochastic matrix P ∈ Rn×p , the relation P = DK is called a stochastic factorization of P if D ∈ Rn×m and K ∈ Rm×p are also stochastic matrices. [sent-56, score-0.25]

27 For example, Cohen and Rothblum [7] brieﬂy discuss it as a special case of non-negative matrix factorization, while Cutler and Breiman [8] focus on slightly modiﬁed versions of the stochastic factorization for statistical data analysis. [sent-59, score-0.14]

28 However, in this paper we will focus on a useful property of this type of factorization that seems to have passed unnoticed thus far. [sent-60, score-0.068]

29 We call it the “stochastic-factorization trick”: Given a stochastic factorization of a square matrix, P = DK, swapping the factors of the fac¯ torization yields another transition matrix P = KD, potentially much smaller than the original, which retains the basic topology and properties of P. [sent-61, score-0.21]

30 In this paper we are interested in exploiting the stochastic-factorization trick to reduce the computational cost of dynamic programming. [sent-67, score-0.09]

31 The idea is straightforward: given stochastic factorizations of the transition matrices Pa , we can apply our trick to obtain a reduced MDP that will be solved in place of the original one. [sent-68, score-0.22]

32 In the most general scenario, we would have one independent factorization Pa = Da Ka for each action a ∈ A. [sent-69, score-0.094]

33 When all factorizations share the same matrix D, it is easy to derive theoretical guarantees regarding the quality of the solution of the reduced MDP: Proposition 1 Let M ≡ (S, A, Pa , ra , γ) be a ﬁnite MDP with |S| = n and 0 ≤ γ < 1. [sent-71, score-0.253]

34 Let DKa = Pa ¯ be |A| stochastic factorizations of order m and let ra be vectors in Rm such that D¯ a = ra for all r ¯ ¯ ¯ ¯ ¯ ¯ a ∈ A. [sent-72, score-0.441]

35 Deﬁne the MDP M ≡ (S, A, Pa , ra , γ), with |S| = m and Pa = Ka D. [sent-73, score-0.164]

36 Since ra = D¯ a and DPa = DKa D = Pa D for all a ∈ A, the stochastic matrix D satisﬁes Sorg r ¯ and Singh’s deﬁnition of a soft homomorphism between M and M (see equations (25)–(28) in [10]). [sent-76, score-0.277]

37 ≤ Γ(Q ΓQ ¯a ¯a ∞ ∞ Proposition 1 elucidates the basic mechanism through which one can exploit the stochasticfactorization trick to reduce the number of states in an MDP. [sent-79, score-0.14]

38 4 Kernel-based stochastic factorization In Section 2 we presented KBRL, an approximation scheme for reinforcement learning whose main drawback is its high computational complexity. [sent-83, score-0.19]

39 In Section 3 we discussed how the stochasticfactorization trick can in principle be useful to reduce an MDP, as long as one circumvents the computational burden imposed by the calculation of the matrices involved in the process. [sent-84, score-0.104]

40 We now show how to leverage these two components to produce an algorithm called kernel-based stochastic factorization (KBSF) that overcomes these computational limitations. [sent-85, score-0.123]

41 3 ˆk ˆi As outlined in Section 2, KBRL deﬁnes the probability of a transition from state sb to state sa via a (ˆ b , sa ), where a, b ∈ A. [sent-86, score-0.454]

42 So for each action a ∈ A, the state kernel-averaging, formally denoted κ si k ˆi ˆj sb has an associated stochastic vector pa ∈ R1×n whose non-zero entries correspond to the function κ a (ˆ b , ·) evaluated at sa , k = 1, 2, . [sent-87, score-0.536]

43 Recall that we are dealing with a continuous state space, si k so it is possible to compute an analogous vector for any si ∈ S. [sent-91, score-0.188]

44 Therefore, we can link each state of the original MDP with |A| n-dimensional stochastic vectors. [sent-92, score-0.084]

45 The core strategy of KBSF is to ﬁnd a set of m representative states associated with vectors ka ∈ R1×n whose convex combination can i ˆ approximate the rows of the corresponding Pa . [sent-93, score-0.237]

46 ˆ ˆi KBRL’s matrices Pa have a very speciﬁc structure, since only transitions ending in states sa associated with action a have a non-zero probability of occurrence. [sent-94, score-0.373]

47 Assuming that the matrices Ka have the same strucˆ ¯ ture as Pa , when computing Pa = Ka D we only have to look at the submatrices of Ka and D corre˙ ˙ sponding to the na non-zero columns of Ka . [sent-96, score-0.073]

48 KBSF computes matrices Da and Ka with ¯ ¯ ¯ ˙ ˙ ¯ ¯ si ¯ ¯ j elements d˙iaj = κ(ˆ a , s j ) and kiaj = κ a (si , sa ), where κ is also a kernel. [sent-102, score-0.309]

49 Depending on how the states si and the kernels κ are ¯ and K ¯ ˆ deﬁned, we have DKa ≈ Pa for all a ∈ A. [sent-104, score-0.15]

50 The important point here is that the matrices Pa = DKa are never actually computed, but instead we solve an MDP with m states whose dynamics are given ¯ ˙ ˙ by Pa = Ka D = Ka Da . [sent-105, score-0.082]

51 Algorithm 1 KBSF Input: Sa for all a ∈ A, m Select a set of representative states {s1 , s2 , . [sent-107, score-0.102]

52 , sm } ¯ ¯ ¯ for each a ∈ A do ˙ ¯ si ¯ Compute matrix Da : d˙iaj = κ(ˆ a , s j ) ˙ ˙ a : kiaj = κ a (si , sa ) ¯ j Compute matrix K ˙ ¯ ¯ Compute vector ra : ria = ∑ j kiaj ra j end for ¯ ¯ ˙ ˙ ¯ ¯ ¯ Solve M ≡ (S, A, Pa , ra , γ), with Pa = Ka Da ¯ ˙ ˙ ˙ ⊺ Da2 ⊺ . [sent-110, score-0.89]

53 Da|A| ˜ Return v = ΓDQ∗ , where D⊺ = Da1 ⊺ Observe that we did not describe how to deﬁne the representative states si . [sent-113, score-0.174]

54 Ideally, these states ¯ ˆ would be linked to vectors ka forming a convex hull which contains the rows of Pa . [sent-114, score-0.194]

55 In practice, we i can often resort to simple methods to pick states si in strategic regions of S. [sent-115, score-0.131]

56 Also, the reader might have noticed that the stochastic factorizations ˆ computed by KBSF are in fact approximations of the matrices Pa . [sent-117, score-0.136]

57 The following proposition extends the result of the previous section to the approximate case: ˆ ˆ ˆ ¯ Proposition 2 Let M ≡ (S, A, Pa , ra , γ) be the ﬁnite MDP derived by KBRL and let D, Ka , and ra be the matrices and vector computed by KBSF. [sent-118, score-0.371]

58 Then, ˆ ˜ v∗ − v ∞ ≤ 1 ˆ r max ra − D¯ a 1−γ a ∞+ ˆ Cγ 1 ¯ ˆ Cmax (1 − max di j ) + max Pa − DKa 2 i j 2 a (1 − γ) ∞ , (2) ˆ ˆ where C = maxa,i ria − mina,i ria . [sent-119, score-0.256]

59 1 [12], with all mappings between M and M taken to be identities, to obtain ˆ v∗ − v∗ ∞ ≤ 1 1−γ ˆ max ra − D¯ a r a ∞+ ˆ Cγ ˆ max Pa − DKa 2(1 − γ) a Resorting to Proposition 1, we can substitute (1) and (4) in (3) to obtain (2). [sent-125, score-0.164]

60 (4) Notice that when D is deterministic—that is, when all its non-zero elements are 1—expression (2) reduces to Whitt’s classical result regarding state aggregation in dynamic programming [12]. [sent-127, score-0.062]

61 On the other hand, when the stochastic factorizations are exact, we recover (1), which is a computable version of Sorg and Singh’s bound for soft homomorphisms [10]. [sent-128, score-0.128]

62 This also makes sense, since in this case the stochastic-factorization trick gives rise to an exact homomorphism [13]. [sent-130, score-0.081]

63 All problems considered in this section have a continuous state space and a ﬁnite number of actions and were modeled as discounted tasks with γ = 0. [sent-138, score-0.069]

64 The algorithms’s results correspond to the performance of the greedy decision policy derived from the ﬁnal value function computed. [sent-140, score-0.137]

65 In all cases, the decision policies were evaluated on a set of test states from which the tasks cannot be easily solved. [sent-141, score-0.21]

66 The experiments were carried out in the same way for all tasks: ﬁrst, we collected a set of n sample ˆk transitions (sa , ra , sa ) using a uniformly-random exploration policy. [sent-143, score-0.427]

67 Then the states sa were grouped k k ˆk ¯ by the k-means algorithm into m clusters and a Gaussian kernel κ was positioned at the center of each resulting cluster [14]. [sent-144, score-0.234]

68 In Figure 1a and 1b we observe the effect of ﬁxing the number of transitions n and varying the number of representative states m. [sent-155, score-0.176]

69 We ﬁrst contrast our method with Lagoudakis and Parr’s least-squares policy iteration algorithm (LSPI) [3]. [sent-166, score-0.084]

70 Figure 2 shows the results of LSPI and KBSF on the single and double pole-balancing tasks [16]. [sent-169, score-0.065]

71 The algorithms were evaluated on two sets of states distributed over the region of the state space surrounding the “puddles”. [sent-182, score-0.104]

72 the more commonly-used variants in which the decision policies are evaluated on a single state close to the origin. [sent-193, score-0.155]

73 In contrast, KBSF’s decision policies are able to balance the pole in 90% of the attempts, on average, using as few as m = 30 representative states. [sent-195, score-0.183]

74 To put this number in perspective, recall that some of the test states are quite challenging, with the two poles inclined and falling in opposite directions. [sent-199, score-0.059]

75 KBSF(106 , 200) delivers a decision policy in less than 7 minutes. [sent-202, score-0.137]

76 In addition, the policy-update stage sk requires the deﬁnition of π(ˆ a ) for all n states in the set of sample transitions. [sent-205, score-0.073]

77 In contrast, KBSF only performs O(m3 ) operations to evaluate a decision policy and O(m2 |A|) operations to update it. [sent-206, score-0.175]

78 We conclude our empirical evaluation of KBSF by using it to learn a neurostimulation policy for the treatment of epilepsy. [sent-207, score-0.136]

79 [19] based on real data collected from epileptic rat hippocampus slices. [sent-209, score-0.078]

80 8 60 150 200 50 (c) Performance on double pole-balancing 100 m 150 200 (d) Run time on double pole-balancing Figure 2: Results on the pole-balancing tasks averaged over 50 runs. [sent-221, score-0.105]

81 The values correspond to the fraction of episodes initiated from the test states in which the pole(s) could be balanced for 3000 steps (one minute of simulated time). [sent-222, score-0.091]

82 produce the seizure pattern of the original dynamical system and was later validated through the deployment of a learned treatment policy on a real brain slice [20]. [sent-225, score-0.11]

83 The associated decision problem has a ﬁve-dimensional continuous state space and extremely non-linear dynamics. [sent-226, score-0.097]

84 The goal is to suppress seizures while minimizing the total amount of stimulation needed to do so. [sent-228, score-0.13]

85 We use as a baseline for our comparisons the ﬁxed-frequency stimulation policies usually adopted in standard in vitro clinical studies [20]. [sent-229, score-0.166]

86 In particular, we considered policies that apply electrical pulses at frequencies of 0 Hz, 0. [sent-230, score-0.072]

87 This modiﬁcation made it possible to use m = 50, 000 representative states with KBSF. [sent-234, score-0.102]

88 We compare the decision policies returned by KBSF and LSPI with those computed by ﬁtted Qiteration using Ernst et al. [sent-236, score-0.11]

89 In order to obtain different compromises of the problem’s two conﬂicting objectives, we varied the relative magnitude of the penalties associated with the occurrence of seizures and with the application of an electrical pulse [19, 20]. [sent-252, score-0.107]

90 As shown in Figure 3a, LSPI’s policies seem to prioritize reduction of stimulation at the expense of higher seizure occurrence, which is clearly sub-optimal from a clinical point of view. [sent-255, score-0.155]

91 In contrast, T20 and KBSF are both able to generate decision policies superior to the 1 Hz policy, which is the most efﬁcient stimulation regime known to date in the clinical literature [21]. [sent-257, score-0.182]

92 The algorithms used n = 500, 000 sample transitions to build the approximations. [sent-275, score-0.088]

93 The decision policies were evaluated on episodes of 105 transitions starting from a ﬁxed set of 10 test states drawn uniformly at random. [sent-276, score-0.291]

94 Our empirical results show that KBSF is able to learn very good decision policies with relatively low computational cost. [sent-279, score-0.11]

95 It also has predictable behavior, generally improving its performance as the number of sample transitions or the size of its approximation model increases. [sent-280, score-0.088]

96 We also plan to extend KBSF to the on-line scenario, where the intermediate decision policies generated during the learning process guide the collection of new sample transitions. [sent-282, score-0.124]

97 Kernel-based models for reinforcement learning in continuous state spaces. [sent-316, score-0.111]

98 Generalization in reinforcement learning: Successful examples using sparse coarse coding. [sent-366, score-0.067]

99 Manifold embeddings for model-based reinforcement learning of neurostimulation policies. [sent-388, score-0.119]

100 Manifold embeddings for model-based reinforcement learning under partial observability. [sent-393, score-0.067]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kbsf', 0.718), ('kbrl', 0.418), ('lspi', 0.241), ('sa', 0.175), ('ra', 0.164), ('pa', 0.146), ('ka', 0.135), ('mdp', 0.092), ('dka', 0.091), ('policy', 0.084), ('transitions', 0.074), ('si', 0.072), ('factorization', 0.068), ('reinforcement', 0.067), ('da', 0.067), ('states', 0.059), ('factorizations', 0.058), ('policies', 0.057), ('trick', 0.055), ('stochastic', 0.055), ('decision', 0.053), ('epileptic', 0.052), ('neurostimulation', 0.052), ('seizures', 0.052), ('stimulation', 0.052), ('na', 0.05), ('ormoneit', 0.049), ('dq', 0.045), ('representative', 0.043), ('bush', 0.042), ('double', 0.04), ('occurrence', 0.04), ('kiaj', 0.039), ('ria', 0.039), ('sorg', 0.039), ('episodes', 0.032), ('mcgill', 0.032), ('pole', 0.03), ('transition', 0.029), ('state', 0.029), ('hz', 0.027), ('barreto', 0.026), ('cutler', 0.026), ('homomorphism', 0.026), ('iaj', 0.026), ('seizure', 0.026), ('shadowed', 0.026), ('stochasticfactorization', 0.026), ('torization', 0.026), ('whitt', 0.026), ('action', 0.026), ('rat', 0.026), ('suppress', 0.026), ('tasks', 0.025), ('tted', 0.024), ('seconds', 0.024), ('matrices', 0.023), ('fac', 0.023), ('montreal', 0.023), ('approximator', 0.023), ('jong', 0.021), ('sen', 0.021), ('vitro', 0.021), ('run', 0.021), ('rk', 0.021), ('proposition', 0.02), ('clinical', 0.02), ('contained', 0.02), ('pineau', 0.02), ('compress', 0.02), ('jk', 0.019), ('kernels', 0.019), ('dynamic', 0.019), ('operations', 0.019), ('converges', 0.018), ('irreducible', 0.018), ('ernst', 0.018), ('lagoudakis', 0.018), ('canada', 0.017), ('sb', 0.017), ('matrix', 0.017), ('width', 0.017), ('evaluated', 0.016), ('adopted', 0.016), ('ending', 0.016), ('cost', 0.016), ('rm', 0.016), ('icting', 0.016), ('electrical', 0.015), ('regular', 0.015), ('soft', 0.015), ('nonnegative', 0.015), ('retains', 0.015), ('continuous', 0.015), ('cohen', 0.015), ('unique', 0.015), ('regarding', 0.014), ('dence', 0.014), ('sample', 0.014), ('di', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

Author: Andre S. Barreto, Doina Precup, Joelle Pineau

2 0.092991635 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artiﬁcial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

3 0.07800018 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

4 0.067704767 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difﬁcult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

5 0.06622418 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

6 0.065647453 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

7 0.064689703 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

8 0.058802232 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

9 0.055070248 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

10 0.052099124 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

11 0.05096304 256 nips-2011-Solving Decision Problems with Limited Information

12 0.044743784 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

13 0.042597026 177 nips-2011-Multi-armed bandits on implicit metric spaces

14 0.041940663 283 nips-2011-The Fixed Points of Off-Policy TD

15 0.040529139 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

16 0.037453581 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

17 0.037249401 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

18 0.035984721 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

19 0.035687122 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

20 0.035658807 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.1), (1, -0.069), (2, 0.032), (3, 0.073), (4, -0.115), (5, 0.015), (6, -0.007), (7, -0.005), (8, -0.015), (9, 0.004), (10, -0.001), (11, -0.01), (12, -0.007), (13, -0.012), (14, -0.022), (15, -0.063), (16, 0.042), (17, 0.04), (18, 0.013), (19, -0.011), (20, 0.02), (21, -0.028), (22, 0.066), (23, -0.039), (24, -0.014), (25, 0.047), (26, -0.028), (27, -0.033), (28, -0.021), (29, 0.005), (30, -0.017), (31, 0.009), (32, -0.052), (33, -0.007), (34, -0.054), (35, 0.004), (36, -0.0), (37, 0.052), (38, -0.017), (39, 0.009), (40, -0.0), (41, 0.0), (42, -0.042), (43, 0.042), (44, -0.014), (45, -0.036), (46, -0.034), (47, -0.005), (48, -0.031), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88434327 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

Author: Andre S. Barreto, Doina Precup, Joelle Pineau

2 0.74642503 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

Author: Oliver B. Kroemer, Jan R. Peters

Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a uniﬁed view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Diﬀerence learning, Kernelized Temporal Diﬀerence learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1

3 0.72496259 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

Author: Zhan Lim, Lee Sun, Daniel J. Hsu

Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufﬁcient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1

4 0.66040355 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

5 0.64999592 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper deﬁnes a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both ofﬂine and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

6 0.64220273 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

7 0.62185174 215 nips-2011-Policy Gradient Coagent Networks

8 0.59632182 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

9 0.58796895 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

10 0.57965034 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

11 0.57753778 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

12 0.57612634 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

13 0.54756361 256 nips-2011-Solving Decision Problems with Limited Information

14 0.53641558 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

15 0.53056586 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs

16 0.51524448 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation

17 0.46206954 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning

18 0.44575781 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

19 0.43290102 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

20 0.43156108 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.041), (4, 0.03), (11, 0.013), (20, 0.023), (26, 0.03), (31, 0.098), (33, 0.018), (43, 0.035), (45, 0.086), (57, 0.033), (66, 0.334), (74, 0.052), (83, 0.024), (99, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.6921131 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

Author: Angela Yao, Juergen Gall, Luc V. Gool, Raquel Urtasun

Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from “simple data”, i.e., single activities such as walking or running. In this paper, we present an efﬁcient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art. 1

same-paper 2 0.68960375 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

Author: Andre S. Barreto, Doina Precup, Joelle Pineau

3 0.5934177 301 nips-2011-Variational Gaussian Process Dynamical Systems

Author: Neil D. Lawrence, Michalis K. Titsias, Andreas Damianou

Abstract: High dimensional time series are endemic in applications of machine learning such as robotics (sensor data), computational biology (gene expression data), vision (video sequences) and graphics (motion capture data). Practical nonlinear probabilistic approaches to this data are required. In this paper we introduce the variational Gaussian process dynamical system. Our work builds on recent variational approximations for Gaussian process latent variable models to allow for nonlinear dimensionality reduction simultaneously with learning a dynamical prior in the latent space. The approach also allows for the appropriate dimensionality of the latent space to be automatically determined. We demonstrate the model on a human motion capture data set and a series of high resolution video sequences. 1

4 0.55271626 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

Author: Elad Hazan, Satyen Kale

Abstract: We present an efﬁcient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss deﬁned in [AR09], which is parameterized by a scalar α. We prove that the regret of N EWTRON is O(log T ) when α is a constant that does not vary with horizon T , and at most O(T 2/3 ) if α is allowed to increase to inﬁnity √ with T . For α = O(log T ), the regret is bounded by O( T ), thus solving the open problem of [KSST08, AR09]. Our algorithm is based on a novel application of the online Newton method [HAK07]. We test our algorithm and show it to perform well in experiments, even when α is a small constant. 1

5 0.52494109 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

6 0.44514385 66 nips-2011-Crowdclustering

7 0.43949106 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

8 0.43922019 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

9 0.43908885 229 nips-2011-Query-Aware MCMC

10 0.43870106 186 nips-2011-Noise Thresholds for Spectral Clustering

11 0.43790534 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

12 0.43627107 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

13 0.43572176 241 nips-2011-Scalable Training of Mixture Models via Coresets

14 0.43488488 180 nips-2011-Multiple Instance Filtering

15 0.43473113 197 nips-2011-On Tracking The Partition Function

16 0.43383288 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

17 0.4333894 102 nips-2011-Generalised Coupled Tensor Factorisation

18 0.43317917 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

19 0.43314674 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

20 0.43212625 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions