nips nips2007 nips2007-185 knowledge-graph by maker-knowledge-mining

185 nips-2007-Stable Dual Dynamic Programming


Source: pdf

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

Abstract: Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. [sent-3, score-0.32]

2 In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. [sent-4, score-0.457]

3 1 Introduction Value function representations are dominant in algorithms for dynamic programming (DP) and reinforcement learning (RL). [sent-5, score-0.281]

4 In LP methods, value functions only correspond to the primal formulation of the problem, while in the dual they are replaced by the notion of state (or state-action) visit distributions [1, 2, 3]. [sent-7, score-0.899]

5 Despite the well known LP duality, dual representations have not been widely explored in DP and RL. [sent-8, score-0.445]

6 Recently, we have showed that it is entirely possible to solve DP and RL problems in the dual representation [4]. [sent-9, score-0.378]

7 In this paper, we investigate the convergence properties of these newly proposed dual solution techniques, and show how they can be scaled up by incorporating function approximation. [sent-11, score-0.457]

8 In particular, we find that the standard convergence results for value based approaches also apply to the dual case, even in the presence of function approximation and off-policy updating. [sent-13, score-0.486]

9 The dual approach appears to hold an advantage over the standard primal view of DP/RL in one major sense: since the fundamental objects being represented are normalized probability distributions (i. [sent-14, score-0.75]

10 , belong to a bounded simplex), dual updates cannot diverge. [sent-16, score-0.454]

11 In particular, we find that dual updates converge (i. [sent-17, score-0.488]

12 avoid oscillation) in the very circumstance where primal updates can and often do diverge: gradient-based off-policy updates with linear function approximation [5, 6]. [sent-19, score-0.682]

13 Below, we represent a policy π by an equivalent representation as an |S| × |S||A| matrix Π where Π(s,s a) = π (sa) if s = s, otherwise 0. [sent-22, score-0.263]

14 One can quickly verify that the matrix product ΠP gives the state-to-state transition probabilities induced by the policy π in the environment P , and that P Π gives the state-action to state-action transition probabilities induced by policy π in P . [sent-23, score-0.529]

15 The problem is to compute an optimal policy given either (a) a complete ∗ Current affiliation: Computer Sciences Laboratory, Australian National University, tao. [sent-24, score-0.215]

16 3 Dual Representations Traditionally, DP methods for solving the MDP planning problem are typically expressed in terms of the primal value function. [sent-31, score-0.427]

17 In the primal representation, the policy state-action value function can be specified by an |S||A|×1 ∞ i i vector q = i=0 γ (P Π) r which satisfies q = r + γP Πq. [sent-33, score-0.583]

18 To develop a dual form of stateaction policy evaluation, one considers the linear system d = (1 − γ)ν + γd P Π, where ν is the initial distribution over state-action pairs. [sent-34, score-0.594]

19 Not only is d a proper probability distribution over state-action pairs, it also allows one to easily compute the expected discounted return of the policy π. [sent-35, score-0.242]

20 However, recovering the state-action distribution d is inadequate for policy improvement. [sent-36, score-0.215]

21 The matrix H that satisfies this linear relation is similar to d , in that each row is a probability distribution and the entries H(sa,s a ) correspond to the probability of discounted state-action visits to (s a ) for a policy π starting in state-action pair (sa). [sent-38, score-0.324]

22 For policy improvement, in the primal representation one can derive an improved policy π via the update a∗ (s) = arg maxa q(sa) and π (sa) = 1 if a = a∗ (s), otherwise 0. [sent-41, score-0.965]

23 The dual form of the policy update can be expressed in terms of the state-action matrix H for π is a ∗ (s) = arg maxa H(sa,:) r. [sent-42, score-0.773]

24 In fact, since (1 − γ)q = Hr, the two policy updates given in the primal and dual respectively, must lead to the same resulting policy π . [sent-43, score-1.252]

25 4 DP algorithms and convergence We first investigate whether dynamic programming operators with the dual representations exhibit the same (or better) convergence properties to their primal counterparts. [sent-45, score-1.233]

26 In the tabular case, dynamic programming algorithms can be expressed by operators that are successively applied to current approximations (vectors in the primal case, matrices in the dual), to bring them closer to a target solution; namely, the fixed point of a desired Bellman equation. [sent-47, score-0.835]

27 For a given policy Π, the on-policy operator O is defined as Oq = r + γP Πq and OH = (1 − γ)I + γP ΠH, for the primal and dual cases respectively. [sent-49, score-1.156]

28 The goal of this greedy update is to bring the representations closer to satisfying the optimal-policy Bellman equations q = r + γP Π∗ [q] and H = (1 − γ)I + γP Π∗ [H]. [sent-51, score-0.219]

29 1 On-policy convergence For the on-policy operator O, convergence to the Bellman fixed point is easily proved in the primal case, by establishing a contraction property of O with respect to a specific norm on q vectors. [sent-53, score-0.993]

30 In particular, one defines a weighted 2-norm with weights given by the stationary distribution determined by the policy Π and transition model P : Let z ≥ 0 be a vector such that z P Π = z ; that is, z is the stationary state-action visit distribution for P Π. [sent-54, score-0.538]

31 By the contraction map fixed point theorem [2] there exists a unique fixed point of O in the space of vectors q. [sent-58, score-0.24]

32 Therefore, repeated applications of the on-policy operator converge to a vector qΠ such that qΠ = OqΠ ; that is, qΠ satisfies the policy based Bellman equation. [sent-59, score-0.467]

33 Analogously, for the dual representation H, one can establish convergence of the on-policy operator by first defining an approximate weighted norm over matrices and then verifying that O is a contraction with respect to this norm. [sent-60, score-1.032]

34 Interestingly, this definition allows us to establish the same non-expansion and contraction results as the primal case. [sent-64, score-0.585]

35 We can have P ΠH z,r ≤ H z,r by arguments similar to the primal case. [sent-65, score-0.368]

36 Moreover, the on-policy operator is a contraction with respect to · z,r . [sent-66, score-0.416]

37 This argument shows that on-policy dynamic programming converges in the dual representation, without making direct reference to the primal case. [sent-69, score-0.917]

38 2 Max-policy convergence The strategy for establishing convergence for the nonlinear max operator is similar to the on-policy case, but involves working with a different norm. [sent-72, score-0.36]

39 Instead of considering a 2-norm weighted by the visit probabilities induced by a fixed policy, one simply uses the max-norm in this case: q ∞ = max(sa) |q(sa) |. [sent-73, score-0.208]

40 The contraction property of the M operator with respect to this norm can then be easily established in the primal case: Mq1 − Mq2 ∞ ≤ γ q1 − q2 ∞ (see [2]). [sent-74, score-0.819]

41 As in the on-policy case, contraction suffices to establish the existence of a unique fixed point of M among vectors q, and that repeated application of M converges to this fixed point q∗ such that Mq∗ = q∗ . [sent-75, score-0.281]

42 However, one subtlety here is that the dual fixed point is not unique. [sent-80, score-0.424]

43 This is not a contradiction because the norm on dual representations · z,r is in fact just a pseudo-norm, not a proper norm. [sent-81, score-0.48]

44 5 DP with function approximation Primal and dual updates exhibit strong equivalence in the tabular case, as they should. [sent-85, score-0.579]

45 We next consider the convergence properties of the dynamic programming operators in the context of linear basis approximation. [sent-87, score-0.373]

46 We focus on the on-policy case here, because, famously, the max operator does not always have a fixed point when combined with approximation in the primal case [8], and consequently suffers the risk of divergence [5, 6]. [sent-88, score-0.708]

47 Note that the max operator cannot diverge in the dual case, even with basis approximation, by boundedness alone; although the question of whether max updates always converge in this case remains open. [sent-89, score-0.796]

48 Here we establish that a similar bound on approximation error in the primal case can be proved for the dual approach with respect to the on-policy operator. [sent-90, score-0.846]

49 In the primal case, linear approximation proceeds by fixing a small set of basis functions, forming a |S||A|×k matrix Φ, where k is the number of bases. [sent-91, score-0.508]

50 The approximation of q can be expressed ˆ by a linear combination of bases q = Φw where w is a k ×1 vector of adjustable weights. [sent-92, score-0.238]

51 ˆ To ensure that H remains a nonnegative, row normalized approximation to H, we simply add the ˆ ˆ ˆ constraints that H ∈ simplex(Ψ) ≡ {H : vec(H) = Ψw, Ψ ≥ 0,(1 ⊗I)Ψ = 11 ,w ≥ 0, w 1 = 1} where the operator ⊗ is the Kronecker product. [sent-95, score-0.338]

52 In this section, we first introduce operators (projection and gradient step operators) that ensure the approximations stay representable in the given basis. [sent-96, score-0.287]

53 For the composition of the on-policy update and projection operators, we establish a similar bound on approximation error in the dual case as in the primal case. [sent-98, score-1.081]

54 The subtlety resolved by Tsitsiklis and Van Roy [7] is to identify a particular form of best approximation—weighted least squares—that ensures convergence is still achieved when combined with the on-policy operator O. [sent-102, score-0.326]

55 Unfortunately, the fixed point of this combined update operator is not guaranteed to be the best representable approximation of O’s fixed point, qΠ . [sent-103, score-0.497]

56 The map from a general q vector onto its best approximation in col span(Φ) is defined by another operator, P, which projects q into the column span of ˆ 2 ˆ Φ, Pq = argminq∈col span(Φ) q − q z = Φ(Φ ZΦ)−1 Φ Zq, where q is an approximation for ˆ value function q. [sent-106, score-0.276]

57 The important property of this weighted projection is that it is a non-expansion operator in · z , i. [sent-107, score-0.359]

58 Approximate dynamic programming then proceeds by composing the two operators—the on-policy update O with the subspace projection P—to compute the best representable approximation of the one step update. [sent-110, score-0.601]

59 This combined operator is guaranteed to converge, since composing a 1 non-expansion with a contraction is still a contraction, i. [sent-111, score-0.467]

60 Linear function approximation in the dual case is a bit more complicated because matrices are being represented, not vectors, and moreover the matrices need to satisfy row normalization and nonnegativity constraints. [sent-114, score-0.566]

61 Nevertheless, a very similar approach to the primal case can be successfully applied. [sent-115, score-0.368]

62 Recall that in the dual, the state-action visit distribution H is approximated by a linear combination of bases in Ψ. [sent-116, score-0.294]

63 As in the primal case, there is no reason to expect that an update like OH should keep the matrix in the simplex. [sent-117, score-0.492]

64 Therefore, a projection operator must be constructed that determines the best representable approximation to OH. [sent-118, score-0.475]

65 Define the weighted projection operator P over matrices ˆ PH = argmin H − H z,r2 (3) ˆ H∈simplex(Ψ) The projection could be obtained by solving the above quadratic program. [sent-121, score-0.527]

66 A key result is that this projection operator is a non-expansion with respect to the pseudo-norm · z,r . [sent-122, score-0.349]

67 Note that the last projection into the nonnegative halfspace is equivalent to a projection into a linear subspace for some hyperplane tangent to the simplex. [sent-124, score-0.306]

68 Consider just one of these linear projections P 1 2 2 2 H z,r = P1 H + H − P1 H z,r = P1 Hr + Hr − P1 Hr z 2 2 2 2 = P1 Hr z + Hr − P1 Hr z = P1 H z,r + H − P1 H z,r Since the overall projection is just a composition of non-expansions, it must be a non-expansion. [sent-126, score-0.222]

69 As in the primal, approximate dynamic programming can be implemented by composing the onpolicy update O with the projection operator P. [sent-127, score-0.633]

70 Finally, we can recover an approximation bound that is analogous to the primal bound, which bounds the approximation error between H+ and the best representable approximation to the on-policy fixed point HΠ = OHΠ . [sent-132, score-0.668]

71 To compare the primal and dual results, note that despite the similarity of the bounds, the projection operators do not preserve the tight relationship between primal and dual updates. [sent-137, score-1.668]

72 Automatically, the dual updates cannot diverge with compositions like PO and PM; yet, in the primal case, the update PM is known to not have fixed points in some circumstances [8]. [sent-140, score-0.98]

73 Conveniently, the gradient update and projection operators are independent of the on-policy and off-policy updates and can be applied in either case. [sent-145, score-0.483]

74 However, as we will see below, the gradient update operator causes significant instability in the off-policy update, to the degree that divergence is a common phenomenon (much more so than with full projections). [sent-146, score-0.41]

75 Composing approximation with an off-policy update (max operator) in the primal case can be very dangerous. [sent-147, score-0.527]

76 All other operator combinations are better behaved in practice, and even those that are not known to converge usually behave reasonably. [sent-148, score-0.252]

77 Unfortunately, composing the gradient step with an off-policy update is a common algorithm attempted in reinforcement learning (Q-learning with function approximation), despite being the most unstable. [sent-149, score-0.292]

78 In the dual representation, one can derive a gradient update operator in a similar way to the primal, except that it is important to maintain the constraints on the parameters w, since the basis functions are probability distributions. [sent-150, score-0.794]

79 , h) is determined by the underlying dynamic programming update, this gives the composed updates 1 ˆ ˆ ˆ ˆ GOh = h − αΨ(I − 11 )Γ Z(r ⊗I)(h−O h) and k 1 ˆ ˆ ˆ ˆ GMh = h − αΨ(I − 11 )Γ (r ⊗I)(h−Mh) k respectively for the on-policy and off-policy cases (ignoring the additional equality constraints). [sent-158, score-0.255]

80 6 Experimental Results To investigate the effectiveness of the dual representations, we conducted experiments on various domains, including randomly synthesized MDPs, Baird’s star problem [5], and on the mountain car problem. [sent-163, score-0.751]

81 The algorithms were: tabular on-policy (O), projection on-policy (PO), gradient on-policy (GO), tabular off-policy (M), projection off-policy (PM), and gradient off-policy (GM), for both the primal and the dual. [sent-167, score-0.842]

82 For off-policy algorithms, we measure the difference between the values generated by the resulting policy and the values of the optimal policy. [sent-171, score-0.236]

83 The initial values of state-action value functions q are set according to the standard normal distribution, and state-action visit distributions H are chosen uniformly randomly with row normalization. [sent-174, score-0.232]

84 For the synthesized MDPs, we generated the transition and reward functions of the MDPs randomly—the transition function is uniformly distributed between 0 and 1 and the reward function is drawn from a standard normal. [sent-177, score-0.239]

85 Here we only reported the results of random MDPs with 100 states, 5 actions, and 10 bases, observed consistent convergence of the dual representations on a variety of MDPs, with different numbers of states, actions, and bases. [sent-178, score-0.516]

86 In Figure 1(right), the curve for the gradient off-policy update (GM) in the primal case (dotted line with the circle marker) blows up (diverges), while all the other algorithms in Figure 1 converge. [sent-179, score-0.53]

87 Interestingly, the approximate error of the dual algorithm POH (4. [sent-180, score-0.355]

88 60×10−3 ) is much smaller than the approximate error of the corresponding primal algorithm POq (4. [sent-181, score-0.368]

89 In these experiments, we used the same fixed policy and linear value function approximation as in [5]. [sent-186, score-0.299]

90 In the dual, the number of bases is also set to 14 and the initial values of the state-action visit distribution matrix H are uniformly distributed random numbers between 0 and 1 with row normalization. [sent-187, score-0.328]

91 The gradient off-policy update in the primal case diverges (see the dotted line with the circle marker in Figure 2(right)). [sent-188, score-0.595]

92 However, all the updates with the dual representation algorithms converge. [sent-189, score-0.477]

93 The number of bases was chosen to be 5 for both the primal and dual algorithms. [sent-191, score-0.817]

94 In the primal representations with linear function approximation, we randomly generated basis functions according to the standard normal distribution. [sent-193, score-0.536]

95 In the dual representations, we randomly picked the basis distributions according to the uniform distribution. [sent-194, score-0.409]

96 In Figure 3(right), we again observed divergence of the gradient off-policy update on state-action values in the primal, and the convergence of all the dual algorithms (see Figure 3). [sent-195, score-0.618]

97 Again, the approximation error of the projected on-policy update POH in the dual (1. [sent-196, score-0.514]

98 We extended the dual dynamic programming algorithms with linear function approximation, and studied the convergence properties of the dual algorithms for planning in MDPs. [sent-200, score-0.964]

99 We demonstrated that dual algorithms, since they are based on estimating normalized probability distributions rather than unbounded value functions, avoid divergence even in the presence of approximation and off-policy updates. [sent-201, score-0.494]

100 Moreover, dual algorithms remain stable in situations where standard value function estimation diverges. [sent-202, score-0.355]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('primal', 0.368), ('dual', 0.355), ('oh', 0.257), ('sa', 0.226), ('operator', 0.218), ('policy', 0.215), ('hr', 0.209), ('oq', 0.185), ('contraction', 0.176), ('visit', 0.176), ('poh', 0.166), ('ph', 0.147), ('mountain', 0.129), ('mq', 0.129), ('operators', 0.113), ('poq', 0.111), ('projection', 0.109), ('pm', 0.102), ('update', 0.099), ('updates', 0.099), ('gm', 0.094), ('bases', 0.094), ('representations', 0.09), ('representable', 0.088), ('simplex', 0.087), ('mdps', 0.082), ('star', 0.076), ('composing', 0.073), ('car', 0.072), ('span', 0.072), ('convergence', 0.071), ('vec', 0.071), ('dp', 0.068), ('dynamic', 0.068), ('programming', 0.066), ('bellman', 0.066), ('synthesized', 0.065), ('tabular', 0.065), ('gradient', 0.063), ('reference', 0.06), ('approximation', 0.06), ('diverge', 0.059), ('matrices', 0.059), ('reinforcement', 0.057), ('col', 0.055), ('po', 0.055), ('reward', 0.05), ('composition', 0.049), ('jh', 0.048), ('pq', 0.048), ('mdp', 0.046), ('maxa', 0.045), ('pythagorean', 0.044), ('mh', 0.044), ('establish', 0.041), ('projections', 0.04), ('diverges', 0.039), ('stationary', 0.039), ('subspace', 0.038), ('mhr', 0.037), ('subtlety', 0.037), ('rl', 0.037), ('transition', 0.037), ('norm', 0.035), ('lp', 0.035), ('converge', 0.034), ('expressed', 0.034), ('xed', 0.033), ('row', 0.033), ('circumstance', 0.032), ('weighted', 0.032), ('point', 0.032), ('basis', 0.031), ('investigate', 0.031), ('divergence', 0.03), ('bring', 0.03), ('bowling', 0.029), ('onto', 0.029), ('actions', 0.029), ('maintain', 0.028), ('zq', 0.027), ('normalized', 0.027), ('discounted', 0.027), ('tsitsiklis', 0.026), ('adjustable', 0.026), ('marker', 0.026), ('nonnegative', 0.026), ('matrix', 0.025), ('planning', 0.025), ('linear', 0.024), ('randomly', 0.023), ('satis', 0.023), ('representation', 0.023), ('stay', 0.023), ('respect', 0.022), ('equality', 0.022), ('steps', 0.022), ('unbounded', 0.022), ('wt', 0.022), ('difference', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 185 nips-2007-Stable Dual Dynamic Programming

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

Abstract: Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1

2 0.15250193 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.14517356 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

4 0.12609148 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra

Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

5 0.12485795 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

6 0.12369841 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

7 0.12284797 40 nips-2007-Bundle Methods for Machine Learning

8 0.12123439 162 nips-2007-Random Sampling of States in Dynamic Programming

9 0.11272524 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

10 0.11177061 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

11 0.097551577 30 nips-2007-Bayes-Adaptive POMDPs

12 0.088108331 60 nips-2007-Contraction Properties of VLSI Cooperative Competitive Neural Networks of Spiking Neurons

13 0.079678513 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

14 0.079664886 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

15 0.079089962 128 nips-2007-Message Passing for Max-weight Independent Set

16 0.071570762 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

17 0.071141683 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

18 0.069237359 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data

19 0.06533625 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

20 0.065183446 22 nips-2007-Agreement-Based Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.193), (1, -0.2), (2, 0.009), (3, -0.018), (4, -0.167), (5, -0.018), (6, 0.049), (7, 0.027), (8, -0.01), (9, 0.138), (10, 0.088), (11, -0.112), (12, 0.067), (13, 0.005), (14, 0.008), (15, -0.047), (16, 0.017), (17, 0.1), (18, -0.039), (19, -0.016), (20, -0.013), (21, -0.03), (22, -0.069), (23, -0.045), (24, -0.041), (25, -0.09), (26, -0.032), (27, -0.09), (28, 0.049), (29, -0.037), (30, -0.113), (31, 0.021), (32, 0.031), (33, -0.063), (34, -0.071), (35, -0.084), (36, 0.002), (37, -0.083), (38, 0.044), (39, 0.023), (40, 0.105), (41, -0.002), (42, 0.147), (43, 0.021), (44, 0.03), (45, -0.022), (46, 0.083), (47, 0.058), (48, -0.089), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96555763 185 nips-2007-Stable Dual Dynamic Programming

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

Abstract: Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1

2 0.60927582 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.58530623 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy

Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1

4 0.57999825 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

5 0.53768212 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

6 0.51567394 40 nips-2007-Bundle Methods for Machine Learning

7 0.50748807 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

8 0.50620908 102 nips-2007-Incremental Natural Actor-Critic Algorithms

9 0.50401527 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

10 0.46647674 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.45799401 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

12 0.40957958 52 nips-2007-Competition Adds Complexity

13 0.39846641 128 nips-2007-Message Passing for Max-weight Independent Set

14 0.39776191 163 nips-2007-Receding Horizon Differential Dynamic Programming

15 0.37146255 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

16 0.35830763 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations

17 0.35120907 63 nips-2007-Convex Relaxations of Latent Variable Training

18 0.35102716 22 nips-2007-Agreement-Based Learning

19 0.34812394 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

20 0.34771749 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.049), (13, 0.065), (16, 0.045), (18, 0.011), (21, 0.047), (31, 0.039), (34, 0.01), (35, 0.022), (44, 0.257), (47, 0.095), (49, 0.049), (53, 0.011), (56, 0.014), (83, 0.1), (85, 0.028), (90, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73556304 185 nips-2007-Stable Dual Dynamic Programming

Author: Tao Wang, Michael Bowling, Dale Schuurmans, Daniel J. Lizotte

Abstract: Recently, we have introduced a novel approach to dynamic programming and reinforcement learning that is based on maintaining explicit representations of stationary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation. 1

2 0.55249166 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández

Abstract: In online handwriting recognition the trajectory of the pen is recorded during writing. Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. However these techniques require considerable human effort, and are specific to particular languages and alphabets. This paper describes a system capable of directly transcribing raw online handwriting data. The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. 1

3 0.55048066 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1

4 0.54942679 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

5 0.54842824 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

6 0.54758573 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

7 0.54658091 86 nips-2007-Exponential Family Predictive Representations of State

8 0.54461777 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

9 0.54419076 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

10 0.54349989 195 nips-2007-The Generalized FITC Approximation

11 0.54294634 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

12 0.54156387 102 nips-2007-Incremental Natural Actor-Critic Algorithms

13 0.54102671 7 nips-2007-A Kernel Statistical Test of Independence

14 0.5402841 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

15 0.53999323 84 nips-2007-Expectation Maximization and Posterior Constraints

16 0.53839362 134 nips-2007-Multi-Task Learning via Conic Programming

17 0.53823996 115 nips-2007-Learning the 2-D Topology of Images

18 0.53823179 174 nips-2007-Selecting Observations against Adversarial Objectives

19 0.53753459 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

20 0.53747761 100 nips-2007-Hippocampal Contributions to Control: The Third Way