nips nips2007 nips2007-34 knowledge-graph by maker-knowledge-mining

34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC


Source: pdf

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. [sent-14, score-0.257]

2 In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. [sent-16, score-0.138]

3 With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. [sent-17, score-0.592]

4 Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. [sent-18, score-0.379]

5 The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event. [sent-19, score-0.555]

6 A large number of methods have been proposed in the literature relying on value function approximation and policy search; including [3, 10, 14, 16, 18]. [sent-22, score-0.388]

7 In this paper, we follow the policy learning approach because of its promise and remarkable success in complex domains; see for example [13, 15]. [sent-23, score-0.355]

8 Our work is strongly motivated by a recent formulation of stochastic planning and control problems as inference problems. [sent-24, score-0.193]

9 This line of work appears to have been initiated in [5], where the authors used EM as an alternative to standard stochastic gradient algorithms to maximize an expected cost. [sent-25, score-0.066]

10 In [2], a planning problem under uncertainty was solved using a Viterbi algorithm. [sent-26, score-0.083]

11 In these works, the standard discounted reward control problem was expressed in terms of an infinite mixture of MDPs. [sent-30, score-0.272]

12 To make the problem tractable, the authors proposed to truncate the infinite horizon time. [sent-31, score-0.138]

13 Here, we make the observation that, in this probabilistic interpretation of stochastic control, the objective function can be written as the expectation of a positive function with respect to a transdimensional probability distribution, i. [sent-32, score-0.092]

14 In this paper, we propose a full Bayesian policy search alternative to the EM algorithm. [sent-38, score-0.443]

15 In this approach, we set a prior distribution on the set of policy parameters and derive an artificial posterior distribution which is proportional to the prior times the expected reward. [sent-39, score-0.439]

16 We sample from the resulting artificial posterior distribution using a single transdimensional MCMC algorithm, which only involves a simple modification of the MCMC algorithm developed to implement the EM. [sent-42, score-0.125]

17 Although the Bayesian policy search approach can benefit from gradient information, it does not require gradients. [sent-43, score-0.433]

18 Moreover, since the target is proportional to the expected reward, the simulation is guided to areas of high reward automatically. [sent-44, score-0.257]

19 In the fixed policy case, the value function is often computed using importance sampling. [sent-45, score-0.4]

20 In this context, our algorithm could be reinterpreted as an MCMC algorithm sampling from the optimal importance distribution. [sent-46, score-0.149]

21 2 Model formulation We consider the following class of discrete-time Markov decision processes (MDPs): X1 ∼ µ(·) Xn | (Xn−1 = x, An−1 = a) ∼ fa ( ·| x) Rn | (Xn = x, An = a) ∼ ga ( ·| x) An | (Xn = x, θ) ∼ πθ ( ·| x) , (1) where n = 1, 2, . [sent-47, score-0.111]

22 is a discrete-time index, µ(·) is the initial state distribution, {Xn } is the X −valued state process, {An } is the A−valued action process, {Rn } is a positive real-valued reward process, fa denotes the transition density, ga the reward density and πθ is a randomized policy. [sent-50, score-0.639]

23 If we have a deterministic policy then πθ ( a| x) = δϕθ (x) (a). [sent-51, score-0.389]

24 In this case, the transition model fa ( ·| x) assumes the parametrization fθ ( ·| x). [sent-52, score-0.089]

25 The reward model could also be parameterized as gθ ( ·| x). [sent-53, score-0.23]

26 We are here interested in maximizing with respect to the parameters of the policy θ the expected future reward ∞ π Vµ (θ) = E γ n−1 Rn , n=1 where 0 < γ < 1 is a discount factor and the expectation is with respect to the probabilistic model defined in (1). [sent-55, score-0.556]

27 As shown in [20], it is possible to re-write this objective of optimizing an infinite horizon discounted reward MDP (where the reward happens at each step) as one of optimizing an infinite mixture of finite horizon MDPs (where the reward only happens at the last time step). [sent-56, score-0.834]

28 Specifically we have: ∞ π Vµ (θ) = (1 − γ) −1 Epθ [RK ] = (1 − γ)−1 rk pθ (k, x1:k , a1:k , rk ) dx1:k da1:k drk k=1 2 (3) for a randomized policy. [sent-58, score-0.566]

29 Similarly, for a deterministic policy, the representation (3) also holds for the trans-dimensional probability distribution defined on {k} × X k × R+ given by k pθ (k, x1:k , rk ) = (1 − γ) γ k−1 µ (x1 ) gθ ( rk | xk ) fθ ( xn | xn−1 ) . [sent-59, score-0.852]

30 pθ (K, X1:K , A1:K , RK ))] , e rk pθ (k, x1:k , a1:k , rk ) . [sent-62, score-0.53]

31 The standard Monte Carlo EM approach consists of sampling from pθ (k, x1:k , a1:k , rk ) using MCMC to obtain a Monte Carlo estimate of the Q function. [sent-65, score-0.29]

32 As pθ (k, x1:k , a1:k , rk ) is proportional to the reward, the samples will consequently be drawn in regions of high reward. [sent-66, score-0.348]

33 This is a particularly interesting feature in situations where the reward function is concentrated in a region of low probability mass under pθ (k, x1:k , rk ), which is often the case in high-dimensional control settings. [sent-67, score-0.537]

34 Note that if we wanted to estimate π Vµ (θ) using importance sampling, then the distribution pθ (k, x1:k , a1:k , rk ) corresponds to the optimal zero-variance importance distribution. [sent-68, score-0.445]

35 pθ (k, x1:k , a1:k , rk ) = Alternatively, instead of sampling from pθ (k, x1:k , a1:k , rk ) using MCMC, we could proceed as in [20] to derive forward-backward algorithms to implement the E-step which can be implemented here using Sequential Monte Carlo (SMC) techniques. [sent-69, score-0.579]

36 Finally, we remark that for a deterministic policy, we can introduce the trans-dimensional distribution: rk pθ (k, x1:k , rk ) pθ (k, x1:k , rk ) = Epθ [RK ] . [sent-73, score-0.829]

37 In addition, and for ease of presentation only, we focus the discussion on deterministic policies and reward functions gθ ( rn | xn ) = δr(xn ) (rn ) ; the extension of our algorithms to the randomized case is straightforward. [sent-74, score-0.404]

38 3 Bayesian policy exploration The EM algorithm is particularly sensitive to initialization and might get trapped in a severe loπ cal maximum of Vµ (θ). [sent-75, score-0.378]

39 Moreover, in the general state-space setting that we are considering, the particle smoothers in the E-step can be very expensive computationally. [sent-76, score-0.058]

40 The idea consists of introducing a vague prior distribution p (θ) on the parameters of the policy θ. [sent-79, score-0.426]

41 By construction, this target distribution admits the following marginal in θ π p (θ) ∝ Vµ (θ) p (θ) and we can select an improper prior distribution p (θ) ∝ 1 if 3 Θ π Vµ (θ) dθ < ∞. [sent-81, score-0.052]

42 If we could sample from p (θ), then the generated samples θ(i) would concentrate themselves π in regions where Vµ (θ) is large. [sent-82, score-0.078]

43 We cannot sample from p (θ) directly but we can developed a trans-dimensional MCMC algorithm which will generate asymptotically samples from p (θ, k, x1:k ), hence samples from p (θ). [sent-83, score-0.157]

44 Assume the current state of the Markov chain targeting p (θ, k, x1:k ) is (θ, k, x1:k ). [sent-85, score-0.167]

45 We propose first to update the components (k, x1:k ) conditional upon θ using a combination of birth, death and update moves using the reversible jump MCMC algorithm [7, 8, 17]. [sent-86, score-0.849]

46 Then we propose to update θ conditional upon the current value of (k, x1:k ). [sent-87, score-0.083]

47 The details of the reversible jump algorithm are presented in the following section. [sent-91, score-0.459]

48 • If (u ≤ bk ) – then carry out a “birth” move: Increase the horizon length of the MDP, say k(i) = k(i−1) + 1 and insert a new state. [sent-96, score-0.291]

49 – else if (u ≤ bk + dk ) then carry out a “death” move: decrease the horizon length of the MDP, say k(i) = k(i−1) − 1 and an existing state. [sent-97, score-0.459]

50 (i) – else let k(i) = k(i−1) and generate samples x1:k(i) of the MDP states. [sent-98, score-0.051]

51 (i) • Sample the policy parameters θ(i) conditional on the samples (x1:k(i) , k(i) ). [sent-100, score-0.406]

52 Figure 1: Generic reversible jump MCMC for Bayesian policy learning. [sent-101, score-0.814]

53 We note that for a given θ the samples of the states and horizon generated by this Markov chain will also be distributed (asymptotically) according to the trans-dimensional distribution pθ (k, x1:k ). [sent-102, score-0.277]

54 The trans-dimensional simulation approach has the advantage that the samples will concentrate themselves automatically in regions where pθ (k) has high probability masses. [sent-105, score-0.102]

55 4 Trans-Dimensional Markov chain Monte Carlo We present a simple reversible jump method composed of two reversible moves (birth and death) and several update moves. [sent-107, score-0.931]

56 Assume the current state of the Markov chain targeting pθ (k, x1:k ) is (k, x1:k ). [sent-108, score-0.167]

57 With probability1 bk , we propose a birth move; that is we sample a location uniformly in the interval {1, . [sent-109, score-0.451]

58 , k + 1}, and propose the candidate (k + 1, x1:j−1 , x∗ , xj:k ) where X ∗ ∼ qθ ( ·| xj−1:j ). [sent-117, score-0.069]

59 This candidate is accepted with probability Abirth = min{1, αbirth } where we have for j ∈ {2, . [sent-118, score-0.085]

60 r (xk ) bk qθ ( x∗ | xk ) With probability dk , we propose a death move; that is J ∼ U {1, . [sent-123, score-0.792]

61 , k} and we propose the candidate (k − 1, x1:j−1 , xj+1:k ) which is accepted with probability Adeath = min{1, αdeath } where for j ∈ {2, . [sent-126, score-0.121]

62 γr (xk ) fθ ( xk | xk−1 ) dk The αbirth and αdeath terms derived above can be thought of as ratios between the distribution over the newly proposed state of the chain (i. [sent-130, score-0.489]

63 These terms must also ensure reversibility and the dimension-matching requirement for reversible jump MCMC. [sent-133, score-0.459]

64 αdeath = Finally with probability uk = 1 − bk − dk , we propose a standard (fixed dimensional) move where we update all or a subset of the components x1:k using say Metropolis-Hastings or Gibbs moves. [sent-135, score-0.483]

65 If one adopts a simple one-at-a time Metropolis-Hastings scheme with proposals qθ ( x∗ | xj−1:j+1 ) to update the j-th term, then the candidate is accepted with probability Aupdate = min{1, αupdate } where for j ∈ {2, . [sent-138, score-0.132]

66 r (xk ) fθ ( xk | xk−1 ) qθ ( x∗ | xk−1:k ) (i) Under weak assumptions on the model, the Markov chain {K (i) , X1:K } generated by this transition kernel will be irreducible and aperiodic and hence will generate asymptotically samples from the target distribution pθ (k, x1:k ). [sent-142, score-0.416]

67 Hence the probability of having the reversible moves accepted will be reasonable. [sent-144, score-0.38]

68 Standard Bayesian applications of reversible jump MCMC usually do not enjoy this property and it makes it more difficult to design fast mixing algorithms. [sent-145, score-0.496]

69 Ten sample points are shown distributed according to µ, the initial distribution, and the contour plot corresponds to the reward function r. [sent-164, score-0.229]

70 The red line denotes the policy parameterized by some angle θ, while a path is drawn in blue sampled from this policy. [sent-165, score-0.384]

71 We will consider state- and action-spaces X = A = R2 such that each state x ∈ X is a 2d position and each action a ∈ A is a vector corresponding to a change in position. [sent-167, score-0.052]

72 Finally we will let µ be a normal distribution about the origin, and consider a reward (as in [20]) given by 1 an unnormalized Gaussian about some point m, i. [sent-169, score-0.227]

73 For these experiments we chose a simple, stochastic policy parameterized by θ ∈ [0, 2π]. [sent-173, score-0.424]

74 Intuitively, this policy corresponds to choosing a direction θ in which the agent will walk. [sent-175, score-0.355]

75 For a state-space with initial distribution and reward function defined as in Figure 2 the optimal policy corresponds to θ = π/4. [sent-177, score-0.609]

76 We first implemented a simple SMC-based extension of the EM algorithm described in [20], wherein a particle filter was used for the forwards/backwards filters. [sent-178, score-0.058]

77 The first thing of note is the terrible performance of the SMC-based algorithm—in fact we had to make the reward broader and closer to the initial position in order to ensure that the algorithm converges in a reasonable amount 2 of time. [sent-181, score-0.201]

78 This comes as no surprise considering the O(N 2 kmax ) time complexity necessary for computing the importance weights. [sent-182, score-0.136]

79 While there do exist methods [9] for reducing this complexity to 2 O(N log N kmax ), the discrepancy between this and the reversible jump MCMC method suggests that the MCMC approach may be more adapted to this class of problems. [sent-183, score-0.55]

80 In the finite/discrete case 2 it is also possible, as shown by Toussaint et al (2006), to reduce the kmax term to kmax by calculating updates only using messages from the backwards recursion. [sent-184, score-0.212]

81 The SMC method might further be improved by better choices for the artificial distribution ηn (xn ) in the backwards filter. [sent-185, score-0.056]

82 Also shown in figure 3 is the performance of a Monte Carlo EM algorithm using reversible jump MCMC in the E-step. [sent-188, score-0.459]

83 7 0 500 1000 1500 2000 2500 cpu time (in seconds) 0. [sent-203, score-0.058]

84 7 0 200 400 600 800 1000 cpu time (in seconds) Figure 3: The left figure shows estimates for the policy parameter θ as a function of the CPU time used to calculate that value. [sent-204, score-0.413]

85 In both plots a red line denotes the known optimal policy parameter of π/4. [sent-208, score-0.405]

86 Finally, we also compared the proposed Bayesian policy exploration method to the PEGASUS [14] approach using a local search method. [sent-211, score-0.407]

87 6 Discussion We believe that formulating stochastic control as a trans-dimensional inference problem is fruitful. [sent-215, score-0.111]

88 This formulation relies on minimal assumptions and allows us to apply modern inference algorithms to solve control problems. [sent-216, score-0.095]

89 We have focused here on Monte Carlo methods and have presented— to the best of our knowledge—the first application of reversible jump MCMC to policy search. [sent-217, score-0.814]

90 Our results, on an illustrative example, showed that this trans-dimensional MCMC algorithm is more effective that standard policy search methods and alternative Monte Carlo methods relying on particle filters. [sent-218, score-0.498]

91 For such scenarios, we expect that it will be necessary to develop more efficient MCMC strategies to explore the policy space efficiently. [sent-220, score-0.355]

92 7 Evolution of policy parameters against transition-model samples policy parameter (theta) 0. [sent-250, score-0.761]

93 20 1000 2000 3000 4000 5000 6000 7000 8000 number of samples taken from transition-model 9000 Figure 4: Convergence of PEGASUS and our Bayesian policy search algorithm when started from θ = 0 and converging to the optimum of θ∗ = π/4. [sent-257, score-0.458]

94 For our algorithm we plot samples taken directly from the MCMC algorithm itself: plotting the empirical average would produce an estimate whose convergence is almost immediate, but we also wanted to show the “burn-in” period. [sent-259, score-0.116]

95 For both algorithms lines denoting one standard deviation are shown and performance is plotted against the number of samples taken from the transition model. [sent-260, score-0.088]

96 On solving integral equations using Markov chain Monte Carlo methods. [sent-265, score-0.097]

97 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. [sent-270, score-0.261]

98 Fast particle smoothing: If i had a million particles. [sent-283, score-0.058]

99 Optimal Bayesian design by inhomogeneous Markov chain simuu o lation. [sent-299, score-0.134]

100 PEGASUS: A policy search method for large MDPs and POMDPs. [sent-320, score-0.407]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.355), ('reversible', 0.295), ('rk', 0.265), ('xj', 0.245), ('mcmc', 0.242), ('birth', 0.227), ('death', 0.227), ('reward', 0.201), ('bk', 0.188), ('xk', 0.173), ('monte', 0.173), ('dk', 0.168), ('jump', 0.164), ('carlo', 0.164), ('em', 0.147), ('horizon', 0.103), ('chain', 0.097), ('kmax', 0.091), ('pegasus', 0.091), ('xn', 0.089), ('bayesian', 0.085), ('doucet', 0.077), ('mdps', 0.072), ('markov', 0.069), ('toussaint', 0.068), ('cpu', 0.058), ('particle', 0.058), ('planning', 0.058), ('reinterpreted', 0.052), ('transdimensional', 0.052), ('freitas', 0.052), ('fa', 0.052), ('search', 0.052), ('mdp', 0.052), ('accepted', 0.052), ('samples', 0.051), ('update', 0.047), ('ep', 0.046), ('control', 0.046), ('importance', 0.045), ('radians', 0.045), ('targeting', 0.045), ('vague', 0.045), ('rare', 0.045), ('british', 0.045), ('rn', 0.044), ('move', 0.044), ('columbia', 0.043), ('smc', 0.041), ('arnaud', 0.041), ('notoriously', 0.041), ('smoothing', 0.041), ('stochastic', 0.04), ('arti', 0.037), ('design', 0.037), ('transition', 0.037), ('nando', 0.037), ('wanted', 0.037), ('cial', 0.036), ('propose', 0.036), ('randomized', 0.036), ('ga', 0.035), ('truncate', 0.035), ('deterministic', 0.034), ('relying', 0.033), ('candidate', 0.033), ('illustration', 0.033), ('moves', 0.033), ('asymptotically', 0.032), ('proportional', 0.032), ('backwards', 0.03), ('robotics', 0.029), ('parameterized', 0.029), ('plot', 0.028), ('valued', 0.028), ('optimal', 0.027), ('ller', 0.027), ('action', 0.027), ('concentrate', 0.027), ('distribution', 0.026), ('gradient', 0.026), ('inference', 0.025), ('state', 0.025), ('situations', 0.025), ('discounted', 0.025), ('proceeds', 0.025), ('uncertainty', 0.025), ('gure', 0.025), ('sampling', 0.025), ('simulation', 0.024), ('implement', 0.024), ('formulation', 0.024), ('developed', 0.023), ('plots', 0.023), ('initialization', 0.023), ('vlassis', 0.023), ('adoption', 0.023), ('adverse', 0.023), ('andrieu', 0.023), ('briers', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

2 0.25111309 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.21089767 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

4 0.1759776 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

5 0.17545953 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

6 0.16827755 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

7 0.16639161 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.16386756 125 nips-2007-Markov Chain Monte Carlo with People

9 0.16344798 146 nips-2007-On higher-order perceptron algorithms

10 0.1485358 213 nips-2007-Variational Inference for Diffusion Processes

11 0.12929671 163 nips-2007-Receding Horizon Differential Dynamic Programming

12 0.12609148 185 nips-2007-Stable Dual Dynamic Programming

13 0.11963373 30 nips-2007-Bayes-Adaptive POMDPs

14 0.1178023 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

15 0.11093225 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

16 0.10874504 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

17 0.10500887 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

18 0.10419642 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

19 0.098791413 203 nips-2007-The rat as particle filter

20 0.098413572 43 nips-2007-Catching Change-points with Lasso


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.262), (1, -0.304), (2, 0.05), (3, -0.078), (4, -0.235), (5, 0.041), (6, -0.011), (7, -0.097), (8, -0.061), (9, 0.018), (10, -0.01), (11, -0.09), (12, 0.003), (13, 0.046), (14, 0.034), (15, 0.111), (16, -0.11), (17, 0.101), (18, -0.05), (19, -0.099), (20, 0.157), (21, -0.039), (22, -0.108), (23, 0.065), (24, 0.024), (25, -0.046), (26, -0.02), (27, -0.099), (28, -0.033), (29, 0.084), (30, -0.035), (31, 0.061), (32, 0.056), (33, -0.078), (34, 0.148), (35, 0.024), (36, -0.009), (37, 0.113), (38, -0.097), (39, 0.012), (40, -0.044), (41, 0.041), (42, -0.012), (43, -0.041), (44, -0.093), (45, -0.013), (46, 0.077), (47, 0.198), (48, -0.002), (49, 0.096)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9718973 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

2 0.63213909 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.61382002 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

4 0.56672502 125 nips-2007-Markov Chain Monte Carlo with People

Author: Adam Sanborn, Thomas L. Griffiths

Abstract: Many formal models of cognition implicitly use subjective probability distributions to capture the assumptions of human learners. Most applications of these models determine these distributions indirectly. We propose a method for directly determining the assumptions of human learners by sampling from subjective probability distributions. Using a correspondence between a model of human choice and Markov chain Monte Carlo (MCMC), we describe a method for sampling from the distributions over objects that people associate with different categories. In our task, subjects choose whether to accept or reject a proposed change to an object. The task is constructed so that these decisions follow an MCMC acceptance rule, defining a Markov chain for which the stationary distribution is the category distribution. We test this procedure for both artificial categories acquired in the laboratory, and natural categories acquired from experience. 1

5 0.55695009 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

6 0.54929638 162 nips-2007-Random Sampling of States in Dynamic Programming

7 0.53964829 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

8 0.51289493 185 nips-2007-Stable Dual Dynamic Programming

9 0.51222914 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

10 0.50761455 163 nips-2007-Receding Horizon Differential Dynamic Programming

11 0.48483828 102 nips-2007-Incremental Natural Actor-Critic Algorithms

12 0.48432818 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

13 0.44487897 213 nips-2007-Variational Inference for Diffusion Processes

14 0.40418547 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

15 0.40109161 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

16 0.38263774 19 nips-2007-Active Preference Learning with Discrete Choice Data

17 0.38136703 214 nips-2007-Variational inference for Markov jump processes

18 0.38105154 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

19 0.36450729 52 nips-2007-Competition Adds Complexity

20 0.36331511 146 nips-2007-On higher-order perceptron algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.03), (9, 0.01), (13, 0.066), (16, 0.024), (18, 0.09), (19, 0.012), (21, 0.086), (31, 0.034), (34, 0.024), (35, 0.021), (47, 0.127), (48, 0.175), (56, 0.014), (83, 0.099), (85, 0.032), (87, 0.019), (90, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81045258 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

2 0.73988789 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1

3 0.72310543 3 nips-2007-A Bayesian Model of Conditioned Perception

Author: Alan Stocker, Eero P. Simoncelli

Abstract: unkown-abstract

4 0.71007138 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

Author: Michael Ross, Andrew Cohen

Abstract: This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects’ performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it represents classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than previous methods and provides a foundation for further exploration. 1

5 0.70765543 100 nips-2007-Hippocampal Contributions to Control: The Third Way

Author: Máté Lengyel, Peter Dayan

Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1

6 0.70348233 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

7 0.70228761 86 nips-2007-Exponential Family Predictive Representations of State

8 0.69915956 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model

9 0.6965282 125 nips-2007-Markov Chain Monte Carlo with People

10 0.69545472 203 nips-2007-The rat as particle filter

11 0.68944913 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

12 0.68525249 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

13 0.68428034 63 nips-2007-Convex Relaxations of Latent Variable Training

14 0.679591 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

15 0.67528456 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging

16 0.67451322 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

17 0.67392498 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

18 0.67358851 24 nips-2007-An Analysis of Inference with the Universum

19 0.67305565 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression

20 0.67272311 47 nips-2007-Collapsed Variational Inference for HDP