nips nips2007 nips2007-91 knowledge-graph by maker-knowledge-mining

91 nips-2007-Fitted Q-iteration in continuous action-space MDPs


Source: pdf

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Fitted Q-iteration in continuous action-space MDPs Andr´ s Antos a Computer and Automation Research Inst. [sent-1, score-0.05]

2 ca Abstract We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. [sent-8, score-0.706]

3 We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. [sent-9, score-0.778]

4 We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. [sent-10, score-0.281]

5 Here X is a measurable space and M (X ) denotes the set of all probability measures over X . [sent-14, score-0.068]

6 We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . [sent-16, score-0.124]

7 The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . [sent-17, score-0.256]

8 A policy determines the next action given the past observations. [sent-18, score-0.477]

9 Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. [sent-19, score-0.37]

10 The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. [sent-20, score-0.75]

11 We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. [sent-25, score-0.334]

12 The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . [sent-26, score-0.334]

13 Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. [sent-27, score-0.334]

14 We say that a (deterministic stationary) policy π is greedy w. [sent-29, score-0.422]

15 ˆ Under mild technical assumptions, such a greedy policy always exists. [sent-33, score-0.469]

16 It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . [sent-41, score-0.04]

17 We shall use the shorthand notation νf to denote the integral f (x)ν(dx). [sent-47, score-0.106]

18 We denote the space of bounded measurable functions with domain X by B(X ). [sent-48, score-0.108]

19 Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). [sent-49, score-0.214]

20 1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. [sent-53, score-0.051]

21 To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. [sent-54, score-0.051]

22 Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. [sent-55, score-0.094]

23 Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. [sent-56, score-0.203]

24 Furthermore, this function depends on the same data that is used to define the regression problem. [sent-59, score-0.061]

25 , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. [sent-64, score-0.051]

26 Let U be another random variable, uniformly t=1 distributed in [0, 1]. [sent-66, score-0.051]

27 Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). [sent-67, score-0.061]

28 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . [sent-69, score-0.09]

29 The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. [sent-71, score-0.097]

30 Amongst the many regression methods in this paper we have chosen to work with least-squares methods. [sent-72, score-0.061]

31 Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. [sent-75, score-0.058]

32 Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e. [sent-88, score-0.063]

33 Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. [sent-91, score-0.143]

34 However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. [sent-92, score-0.058]

35 Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9]. [sent-98, score-0.039]

36 One possibility is of course to discretize the action space, e. [sent-100, score-0.143]

37 However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). [sent-103, score-0.143]

38 ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. [sent-108, score-0.09]

39 One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. [sent-111, score-0.089]

40 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. [sent-115, score-0.785]

41 Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). [sent-117, score-0.07]

42 (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. [sent-122, score-0.363]

43 The policy search will generally be solved by a gradient procedure or some other appropriate method. [sent-123, score-0.363]

44 The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. [sent-124, score-0.143]

45 However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. [sent-126, score-0.143]

46 Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . [sent-127, score-0.058]

47 , N if the approximately maximizing actions are similar across similar states. [sent-130, score-0.058]

48 This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. [sent-131, score-0.07]

49 Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. [sent-132, score-0.061]

50 For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). [sent-136, score-0.334]

51 Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). [sent-138, score-0.107]

52 Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . [sent-143, score-0.214]

53 Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . [sent-147, score-0.214]

54 Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w. [sent-148, score-0.051]

55 Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . [sent-152, score-0.116]

56 Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. [sent-154, score-0.161]

57 Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. [sent-155, score-0.363]

58 For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. [sent-156, score-0.045]

59 1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. [sent-160, score-0.186]

60 ,N is the sample path of πb , a stochastic stationary policy. [sent-164, score-0.051]

61 Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ). [sent-165, score-0.051]

62 4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. [sent-166, score-0.389]

63 5 Note that the mixing coefficients do not need to be known. [sent-168, score-0.038]

64 We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. [sent-175, score-0.303]

65 2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. [sent-178, score-0.279]

66 A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w. [sent-179, score-0.157]

67 uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). [sent-186, score-0.051]

68 This assumption essentially requires the transitions to be noisy. [sent-187, score-0.077]

69 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . [sent-192, score-0.224]

70 Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. [sent-199, score-0.098]

71 P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. [sent-206, score-0.049]

72 This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. [sent-214, score-0.256]

73 def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. [sent-215, score-0.248]

74 Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . [sent-216, score-0.143]

75 -ball itself, then this assumption will be satisfied with α = 2−dA . [sent-217, score-0.077]

76 , there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . [sent-226, score-0.068]

77 3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. [sent-235, score-0.074]

78 However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. [sent-236, score-0.106]

79 The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . [sent-237, score-0.128]

80 6 Remember that λA denotes the uniform distribution over the action set A. [sent-238, score-0.143]

81 6 We shall also need to control the capacity of our function sets. [sent-239, score-0.195]

82 Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. [sent-251, score-0.433]

83 Besides their capacity, one shall also control the approximation power of the function sets involved. [sent-253, score-0.135]

84 Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. [sent-256, score-0.055]

85 This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w. [sent-263, score-0.334]

86 T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . [sent-266, score-0.113]

87 Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). [sent-267, score-0.058]

88 However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. [sent-269, score-0.06]

89 V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. [sent-281, score-0.04]

90 We suggested that either the set of action-value candidates has to be carefully controlled (e. [sent-291, score-0.041]

91 the state variables), or a policy search step is needed, just like in actorcritic algorithms. [sent-296, score-0.417]

92 It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. [sent-298, score-0.221]

93 We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. [sent-300, score-0.095]

94 Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. [sent-303, score-0.054]

95 Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. [sent-308, score-0.543]

96 Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. [sent-314, score-0.543]

97 Value-iteration based fitted policy iteration: learning with a a single trajectory. [sent-321, score-0.334]

98 Generalization in reinforcement learning: Safely approximating the value function. [sent-354, score-0.057]

99 PEGASUS: A policy search method for large MDPs and POMDPs. [sent-382, score-0.363]

100 Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. [sent-399, score-0.108]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qk', 0.601), ('policy', 0.334), ('xt', 0.269), ('fqi', 0.147), ('action', 0.143), ('tted', 0.131), ('antos', 0.128), ('da', 0.122), ('rmax', 0.117), ('szepesv', 0.117), ('rt', 0.117), ('hungarian', 0.107), ('argminf', 0.107), ('shall', 0.106), ('lipschitzness', 0.098), ('def', 0.098), ('greedy', 0.088), ('mdp', 0.085), ('assumption', 0.077), ('fitted', 0.074), ('kende', 0.074), ('qmax', 0.074), ('policies', 0.07), ('measurable', 0.068), ('academy', 0.068), ('budapest', 0.064), ('eq', 0.062), ('regression', 0.061), ('capacity', 0.06), ('actions', 0.058), ('sup', 0.058), ('reinforcement', 0.057), ('automation', 0.056), ('inf', 0.055), ('vf', 0.055), ('ri', 0.054), ('state', 0.054), ('dk', 0.053), ('regularity', 0.052), ('stationary', 0.051), ('uniformly', 0.051), ('iteration', 0.051), ('continuous', 0.05), ('concentrability', 0.049), ('maxb', 0.049), ('minia', 0.049), ('regress', 0.049), ('niteness', 0.049), ('py', 0.047), ('mild', 0.047), ('assumptions', 0.045), ('dy', 0.043), ('rda', 0.043), ('andr', 0.043), ('anthony', 0.043), ('argmaxa', 0.043), ('csaba', 0.043), ('hungary', 0.043), ('steady', 0.042), ('controlled', 0.041), ('lipschitz', 0.041), ('mdps', 0.041), ('bounded', 0.04), ('learnability', 0.039), ('committed', 0.039), ('fmax', 0.039), ('vq', 0.039), ('mization', 0.039), ('munos', 0.039), ('batch', 0.038), ('mixing', 0.038), ('uncontrolled', 0.036), ('sciences', 0.036), ('bartlett', 0.036), ('lp', 0.035), ('lr', 0.034), ('trajectory', 0.034), ('believe', 0.034), ('rewards', 0.033), ('besides', 0.033), ('remember', 0.033), ('propagated', 0.033), ('ln', 0.033), ('dx', 0.032), ('immediate', 0.032), ('executing', 0.031), ('book', 0.03), ('absolutely', 0.03), ('growth', 0.03), ('hence', 0.03), ('error', 0.029), ('search', 0.029), ('errors', 0.029), ('coming', 0.029), ('la', 0.029), ('alberta', 0.029), ('operator', 0.029), ('control', 0.029), ('reward', 0.028), ('looks', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

2 0.23046651 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

3 0.21696736 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

4 0.20622902 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

5 0.19362479 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

6 0.18890733 162 nips-2007-Random Sampling of States in Dynamic Programming

7 0.17545953 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

8 0.14434764 21 nips-2007-Adaptive Online Gradient Descent

9 0.14004034 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

10 0.1325777 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

11 0.12536715 146 nips-2007-On higher-order perceptron algorithms

12 0.12485795 185 nips-2007-Stable Dual Dynamic Programming

13 0.1213284 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

14 0.11634382 215 nips-2007-What makes some POMDP problems easy to approximate?

15 0.11060454 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

16 0.097021222 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

17 0.091028407 213 nips-2007-Variational Inference for Diffusion Processes

18 0.086652867 163 nips-2007-Receding Horizon Differential Dynamic Programming

19 0.081712119 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

20 0.07863424 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, -0.359), (2, 0.091), (3, -0.025), (4, -0.146), (5, 0.057), (6, 0.017), (7, -0.086), (8, -0.046), (9, 0.044), (10, 0.092), (11, 0.05), (12, 0.038), (13, 0.086), (14, -0.04), (15, 0.064), (16, -0.039), (17, 0.02), (18, 0.012), (19, -0.03), (20, 0.044), (21, -0.032), (22, -0.043), (23, -0.034), (24, 0.009), (25, -0.076), (26, 0.138), (27, -0.079), (28, -0.043), (29, 0.07), (30, -0.091), (31, 0.113), (32, -0.039), (33, 0.07), (34, -0.061), (35, -0.055), (36, -0.022), (37, 0.021), (38, -0.018), (39, -0.037), (40, 0.061), (41, -0.012), (42, 0.031), (43, 0.039), (44, 0.034), (45, 0.036), (46, -0.069), (47, 0.004), (48, -0.047), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96036464 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

2 0.81532359 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

3 0.73586446 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy

Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1

4 0.72461426 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

5 0.68856847 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

6 0.6073451 185 nips-2007-Stable Dual Dynamic Programming

7 0.59846145 52 nips-2007-Competition Adds Complexity

8 0.59374082 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.58784956 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

10 0.57961529 162 nips-2007-Random Sampling of States in Dynamic Programming

11 0.54164696 102 nips-2007-Incremental Natural Actor-Critic Algorithms

12 0.5231244 146 nips-2007-On higher-order perceptron algorithms

13 0.49201748 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

14 0.45514137 21 nips-2007-Adaptive Online Gradient Descent

15 0.45475918 171 nips-2007-Scan Strategies for Meteorological Radars

16 0.41967252 191 nips-2007-Temporal Difference Updating without a Learning Rate

17 0.41705754 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

18 0.41333181 163 nips-2007-Receding Horizon Differential Dynamic Programming

19 0.40517145 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

20 0.40117097 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (13, 0.052), (16, 0.018), (18, 0.015), (19, 0.015), (21, 0.063), (28, 0.255), (31, 0.037), (34, 0.046), (35, 0.021), (47, 0.132), (49, 0.035), (83, 0.128), (90, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78182572 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

2 0.77691954 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1

same-paper 3 0.73769075 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

4 0.63667274 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

5 0.62615567 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

6 0.6246435 43 nips-2007-Catching Change-points with Lasso

7 0.62450308 86 nips-2007-Exponential Family Predictive Representations of State

8 0.6243006 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.61955178 63 nips-2007-Convex Relaxations of Latent Variable Training

10 0.61854541 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

11 0.61746758 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

12 0.61558557 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

13 0.61533511 100 nips-2007-Hippocampal Contributions to Control: The Third Way

14 0.61398906 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

15 0.61195904 158 nips-2007-Probabilistic Matrix Factorization

16 0.61159992 115 nips-2007-Learning the 2-D Topology of Images

17 0.61147338 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

18 0.61137933 69 nips-2007-Discriminative Batch Mode Active Learning

19 0.61114269 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

20 0.61010295 24 nips-2007-An Analysis of Inference with the Universum