nips nips2007 nips2007-148 knowledge-graph by maker-knowledge-mining

148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Source: pdf

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efﬁcient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Speciﬁcally, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efﬁcient algorithm for ﬁnite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We provide a provably efﬁcient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. [sent-7, score-0.576]

2 Speciﬁcally, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. [sent-8, score-0.265]

3 This result builds on Kearns and Singh’s work that provides a provably efﬁcient algorithm for ﬁnite state MDPs. [sent-9, score-0.269]

4 Applied RL has been very successful, producing worldclass computer backgammon players (Tesauro, 1994) and model helicopter ﬂyers (Ng et al. [sent-12, score-0.14]

5 Provably efﬁcient RL for ﬁnite state and action spaces is accomplished by Kearns and Singh (2002) and hugely contributes to our understanding of the relationship between exploration and sequential decision making. [sent-16, score-0.388]

6 The achievement of the current paper is to provide an efﬁcient RL algorithm that learns in Markov Decision Processes (MDPs) with continuous state and action spaces. [sent-17, score-0.45]

7 We prove that it learns linearly-parameterized MDPs, a model introduced by Abbeel and Ng (2005), with sample (or experience) complexity that grows only polynomially with the number of state space dimensions. [sent-18, score-0.225]

8 Our new RL algorithm utilizes a special linear regresser, based on least-squares regression, whose analysis may be of interest to the online learning and statistics communities. [sent-19, score-0.248]

9 The linear dynamics case should be viewed as only an interesting example of our approach, which makes substantial progress in the goal of understanding the relationship between exploration and generalization in RL. [sent-21, score-0.209]

10 In Section 1, we discuss online linear regression and pose a new online learning framework that requires an algorithm to not only provide predictions for new data points but also provide formal guarantees about its predictions. [sent-23, score-0.415]

11 , (xm , ym ), where xi ∈ Rn and yi ∈ R for i = 1, . [sent-32, score-0.201]

12 Further, suppose that the data satisﬁes a linear relationship, that is yi ≈ θT xi ∀i ∈ {1, . [sent-36, score-0.214]

13 When a new input x arrives, we would like to make a prediction of the corresponding output by estimating θ from our data. [sent-40, score-0.229]

14 A standard approach is to approximate ˆ ˆ θ with the least-squares estimator θ deﬁned by θ = (X T X)−1 X T y, where X ∈ Rm×n is a matrix T whose ith row consists of the ith input xi and y ∈ Rn is a vector whose ith component is the ith output yi . [sent-41, score-0.94]

15 Although there are many analyses of the linear regression problem, none is quite right for an application to model-based reinforcement learning (MBRL). [sent-42, score-0.286]

16 In particular, in MBRL, we cannot assume that X is ﬁxed ahead of time and we require more than just a prediction of θ but knowledge about whether this prediction is sufﬁciently accurate. [sent-43, score-0.205]

17 With this idea in mind, we present the following online learning problem related to linear regression. [sent-47, score-0.151]

18 an input vector xt ∈ Rn satisf ying||xt|| ≤ 1 and output number yt ∈ [−1, 1] is provided. [sent-52, score-0.499]

19 The input xt may be chosen in any way that depends on the previous inputs and outputs (x1 , y1 ), . [sent-53, score-0.414]

20 The output yt is chosen probabilistically from a distribution that depends only on xt and satisﬁes E[yt ] = θT xt and Var[yt ] ≤ σ 2 , where θ ∈ Rn is an unknown parameter vector satisfying ||θ|| ≤ 1 and σ ∈ R is a known constant. [sent-57, score-0.773]

21 After observing xt and before observing yt , the learning algorithm must produce an output yt ∈ [−1, 1] ∪ {∅} (a prediction of E[yt |xt ]). [sent-58, score-0.71]

22 Furthermore, it should be ˆ able to provide an output y(x) for any input vector x ∈ {0, 1}n. [sent-59, score-0.146]

23 ˆ A key aspect of our problem that distinguishes it from other online learning models is that the algorithm is allowed to output a special value ∅ rather than make a valid prediction (an output other than ∅). [sent-60, score-0.46]

24 The algorithm would like to minimize the number of times it predicts ∅, and, furthermore, when it does make a valid prediction the prediction must be accurate, with high probability. [sent-62, score-0.287]

25 Deﬁnition 2 We deﬁne an admissible algorithm for the KWIK Linear Regression Problem to be one that takes two inputs 0 ≤ ≤ 1 and 0 ≤ δ < 1 and, with probability at least 1 − δ, satisﬁes the following conditions: 1. [sent-64, score-0.251]

26 Whenever the algorithm predicts yt (x) ∈ [−1, 1], we have that |ˆt (x) − θT x| ≤ . [sent-65, score-0.192]

27 The number of timesteps t for which yt (xt ) = ∅ is bounded by some function ζ( , δ, n), ˆ polynomial in n, 1/ and 1/δ, called the sample complexity of the algorithm. [sent-67, score-0.412]

28 Let X denote an m × n matrix whose rows we interpret as transposed input vectors. [sent-70, score-0.141]

29 We let X(i) denote the transpose of the ith row of X. [sent-71, score-0.245]

30 For a ﬁxed input xt (a new input provided to the algorithm at time t), deﬁne ¯¯ ¯ q := X U Λ−1 U T xt ∈ Rm×n , ¯ (2) T T v = [0, . [sent-91, score-0.624]

31 2: for t = 1, 2, 3, · · · do 3: Let xt denote the input at time t. [sent-99, score-0.283]

32 ¯ ¯ 5: if ||¯|| ≤ α1 and ||¯|| ≤ α2 then q v ˆ ¯ ¯ 6: Choose θ ∈ Rn that minimizes i [y(i) − θT X(i)]2 subject to ||θ|| ≤ 1, where X(i) is the transpose of the ith row of X and y(i) is the ith component of y. [sent-101, score-0.389]

33 Theorem 1 With appropriate parameter settings, Algorithm 1 is an admissible algorithm for the ˜ KWIK Linear Regression Problem with a sample complexity bound of O(n3 / 4 ). [sent-109, score-0.222]

34 Given a new input xt , the algorithm considers making a prediction of the output yt ˆ using the norm-constrained least-squares estimator (speciﬁcally, θ deﬁned in line 6 of Algorithm1). [sent-111, score-0.684]

35 ¯ ¯ When both norms are small, the estimate is trusted and a valid prediction is made. [sent-113, score-0.235]

36 When either norm is large, the estimate is not trusted and the algorithm produces an output of ∅. [sent-114, score-0.223]

37 Thus, x can be written as a linear combination of the rows of X, whose ¯ coefﬁcients make up q , of previously experienced input vectors. [sent-118, score-0.2]

38 , (xm , ym ) is any sequence of samples satisfying xi ∈ Rn , yi ∈ R, ||xi || ≤ 1, yi ∈ [−1, 1], E[yi |xi ] = θT xi , and Var[yi |xi ] ≤ σ 2 . [sent-134, score-0.427]

39 For any 0 < δ < 1 and ﬁxed positive constant z, if m i=1 then ˆ [(θ − θ)T xi ]2 ≥ 2 8m ln(2/δ) + z, m (4) m i=1 ˆ (yi − θT xi )2 > i=1 (yi − θT xi )2 + z (5) with probability at least 1 − 2δ . [sent-135, score-0.204]

40 The following lemma, whose proof is fairly straight-forward and therefore omitted, relates the error ˆ ˆ of an estimate θT x for a ﬁxed input x based on an inaccurate estimator θ to the quantities ||¯||, q m ˆ T X(i)]2 . [sent-136, score-0.147]

41 , (xm , ym ) is any sequence of samples satisfying xi ∈ Rn , yi ∈ R, ˆ ||xi || ≤ 1, yi ∈ [−1, 1]. [sent-143, score-0.359]

42 The ﬁrst is to bound the sample complexity of the algorithm (the number of times the algorithm makes a prediction of ∅), in terms of the input parameters α1 and α2 . [sent-149, score-0.311]

43 The third is to show that, with high probability, every valid prediction made by the algorithm is accurate. [sent-151, score-0.204]

44 Observing that the algorithm trains on only those samples experienced during v pricisely these timesteps and applying Lemma 13 from the paper by Auer (2002) we have that m=O ¯ n ln(n/α1 ) n + 2 2 α1 α2 . [sent-153, score-0.188]

45 Step 3 Consider some ﬁxed timestep t during the execution of Algorithm 1 such that the algorithm ˆ makes a valid prediction (not ∅). [sent-155, score-0.282]

46 Suppose not, namely that |(θ m ˆ ˆ quantity ∆E (θ)2 = i=1 [(θ − θ)T X(i)]2 , where m denotes the number of rows of the matrix X (equivalently, the number of samples obtained used by the algorithm for training, which we upperbounded by m), and X(i) denotes the transpose of the ith row of X. [sent-160, score-0.341]

47 3 Notes In our formulation of KLRP we assumed an upper bound of 1 on the the two-norm of the inputs xi , outputs yi , and the true parameter vector θ. [sent-166, score-0.286]

48 Our analysis of Algorithm 1 showed that it is possible to solve KLRP with polynomial sample complexity (where the sample complexity is deﬁned as the number of timesteps t that the algorithm outputs ∅ for the current input xt ), with high probability. [sent-168, score-0.721]

49 We note that the algorithm also has polynomial computational complexity per timestep, given the tractability of solving norm-constrained least-squares problems (see Chapter 12 of the book by Golub and Van Loan (1996)). [sent-169, score-0.206]

50 4 Related Work Work on linear regression is abundant in the statistics community (Seber & Lee, 2003). [sent-171, score-0.173]

51 Our analysis differs from that by Auer (2002) because we do not assume that the input vectors xi are ﬁxed ahead of time, but rather that they may be chosen in an adversarial manner. [sent-173, score-0.171]

52 However, a crucial difference of our framework and analysis is the use of output ∅ to signify uncertainty in the current estimate, which allows for efﬁcient exploration in the application to RL as described in the next section. [sent-176, score-0.196]

53 ) to maximize an external reward signal by acting in an unknown environment. [sent-178, score-0.161]

54 The algorithm uses, as a subroutine, any admissible algorithm for the KWIK Linear Regression Problem introduced in Section 1. [sent-185, score-0.232]

55 The main difference is that we consider discounted rather than undiscounted MDPs and we don’t require the agent to have a “reset” action that takes it to a speciﬁed start state (or distribution). [sent-189, score-0.434]

56 The environment is described by a discounted MDP M = S, A, T, R, γ , where S = RnS is the state space, A = RnA is the action space, T : S × A → PS is the unknown transition dynamics, γ ∈ [0, 1) is the discount factor, and R : S × A → R is the known reward function. [sent-191, score-0.526]

57 1 For each timestep t, let xt ∈ S denote the current 1 All of our results can easily be extended to the case of an unknown reward function with a suitable linearity assumption. [sent-192, score-0.505]

58 The transition dynamics T satisfy xt+1 = M φ(xt , ut ) + wt , nS +nA (8) n where xt+1 ∈ S, φ(·, ·) : R → R is a (basis or kernel) function satisfying ||φ(·, ·)|| ≤ 1, and M is an nS × n matrix. [sent-194, score-0.196]

59 If an MDP satisﬁes the above conditions we say that it is linearly parameterized, because the next-state xt+1 is a linear function of the vector φ(xt , ut ) (which describes the current state and action) plus a noise term. [sent-200, score-0.233]

60 The agent always occupies a single state s of the MDP M . [sent-203, score-0.252]

61 It then receives an immediate reward r ∼ R(s, a) and is transported to a next state s ∼ T (s, a). [sent-205, score-0.247]

62 The ﬁrst state occupied by the agent may be chosen arbitrarily. [sent-207, score-0.252]

63 For any policy π, let VM (s) (Qπ (s, a)) denote the discounted, inﬁnite-horizon M value (action-value) function for π in M (which may be omitted from the notation) from state s. [sent-210, score-0.286]

64 Speciﬁcally, let st and rt be the tth encountered state and received reward, respectively, resulting π from execution of policy π in some MDP M from state s0 . [sent-211, score-0.42]

65 j=0 ∗ The optimal policy is denoted π ∗ and has value functions VM (s) and Q∗ (s, a). [sent-213, score-0.152]

66 Note that a policy M cannot have a value greater than vmax := 1/(1 − γ) by the assumption of a maximum reward of 1. [sent-214, score-0.341]

67 2 Algorithm First, we discuss how to use an admissible learning algorithm for KLRP to construct an MDP model. [sent-216, score-0.174]

68 Given a ﬁxed state-action pair (s, a), we need to estimate the next-state distribution of the MDP from past experience, which consists of input state-action pairs (transformed by the nonlinear function φ) and output next states. [sent-218, score-0.146]

69 whose ith row is equal to the transpose of the approximate parameter vector θ If any instance of our KLRP algorithm predicts ∅ for state-action pair (s, a), then we cannot estimate the next-state distribution. [sent-224, score-0.342]

70 Following the terminology introduced by Kearns and Singh (2002), we call such a state (state-action) an “unknown” state (state-action) and we ensure that the value function of our model assigns vmax (maximum possible) to state s. [sent-226, score-0.478]

71 The standard way to satisfy this condition for ﬁnite MDPs is to make the transition function for action a from state s a self-loop with reward 1 (yielding a value of vmax = 1/(1 − γ) for state s). [sent-227, score-0.646]

72 We can affect the exact same result in a continuous MDP by adding a component to each state vector s and to each vector φ(s, a) for every state-action pair (s, a). [sent-228, score-0.246]

73 We add an additional row and column to M that preserves this extra component (during the transformation from φ(s, a) to the next state s ) and otherwise doesn’t change the next-state distribution. [sent-230, score-0.223]

74 Finally, we give a reward of 1 to any unknown state, leaving rewards for the known states unchanged. [sent-231, score-0.161]

75 Theorem 2 For any and δ, the KWIK-RMAX algorithm executes an -optimal policy on at most a polynomial (in n, nS , 1/ , 1/δ, and 1/(1 − γ)) number of steps, with probability at least 1 − δ. [sent-233, score-0.31]

76 2 The algorithm can be modiﬁed to deal with bounds (on the norms of the rows of M ) that are larger than one. [sent-234, score-0.142]

77 This is easily dealt with by ignoring any extremely large (or small) outputs and showing that the resulting norm of the truncated normal distribution learned by the each instance Ai is very close to the norm of the untruncated distribution. [sent-236, score-0.171]

78 6 Algorithm 2 KWIK-RMAX Algorithm 0: Inputs: nS , nA , n, R, φ(·, ·), σ, γ, , δ, and admissible learning algorithm ModelLearn. [sent-237, score-0.174]

79 , nS } do 2 √ 2: Initialize a new instantiation of ModelLearn, denoted Ai , with inputs C (1−γ) and δ/nS , 2 n for inputs and δ, respectively, in Deﬁnition 2, and where C is some constant determined by the analysis. [sent-241, score-0.154]

80 3: end for 4: Initialize an MDP Model with state space S, action space A, reward function R, discount factor γ and transition function speciﬁed by Ai for i ∈ {1, . [sent-242, score-0.436]

81 7: Choose action a := π ∗ (s) where π ∗ is the optimal policy of the MDP Model. [sent-247, score-0.292]

82 ˆ ˆ 8: Let s be the next state after executing action a. [sent-248, score-0.274]

83 3 Analysis Proof sketch: (of Theorem 2) ∗ ˆ It can be shown that, with high probability, policy π ∗ is either an -optimal policy (V π (s) ≥ ˆ V ∗ (s) − ) or it is very likely to lead to an unknown state. [sent-255, score-0.352]

84 In any case, discretization of the state space can be used, which yields computational complexity that is exponential in the number of (state and action) dimensions of the problem, similar to the work of Chow and Tsitsiklis (1991). [sent-262, score-0.221]

85 Alternatively, sparse sampling can be used, whose complexity has no dependence on the size of the state space but depends exponentially on the time horizon (≈ 1/(1 − γ)) (Kearns et al. [sent-263, score-0.221]

86 5 Related Work The general exploration problem in continuous state spaces was considered by Kakade et al. [sent-268, score-0.323]

87 ’s (2003) algorithm to linearly-parameterized MDPs results in an algorithm whose sample complexity scales exponentially, rather than polynomially, with the state-space dimension. [sent-271, score-0.203]

88 Reinforcement learning in continuous MDPs with linear dynamics was studied by Fiechter (1997). [sent-273, score-0.17]

89 However, an exact linear relationship between the current state and next state is required for this analysis to go through, while we allow the current state to be transformed (for instance, adding non-linear state features) through non-linear function φ. [sent-274, score-0.595]

90 Furthermore, Fiechter’s algorithm relied on the existence of a “reset” action and a speciﬁc form of reward function. [sent-275, score-0.311]

91 These assumptions admit a solution that follows a ﬁxed policy and doesn’t depend on the actual history of the agent or the underlying MDP. [sent-276, score-0.27]

92 In 7 that work, a provably efﬁcient algorithm was developed in the apprenticeship RL setting. [sent-278, score-0.178]

93 In this setting, the algorithm is given limited access (polynomial number of calls) to a ﬁxed policy (called the teacher’s policy). [sent-279, score-0.21]

94 With high probably, a policy is learned that is nearly as good as the teacher’s policy. [sent-280, score-0.152]

95 Although this framework is interesting and perhaps more useful for certain applications (such as helicopter ﬂying), it requires a priori expert knowledge (to construct the teacher) and alleviates the problem of exploration altogether. [sent-281, score-0.196]

96 Conclusion We have provided a provably efﬁcient RL algorithm that learns a very rich and important class of MDPs with continuous state and action spaces. [sent-283, score-0.527]

97 Our RL algorithm utilized a speciﬁc online linear regression algorithm. [sent-285, score-0.323]

98 We have identiﬁed certain interesting and general properties (see Deﬁnition 2) of this particular algorithm that support online exploration. [sent-286, score-0.15]

99 We believe the approach used with linear regression can be repeated for other important classes, but we leave the details as interesting future work. [sent-289, score-0.173]

100 R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. [sent-307, score-0.271]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mdps', 0.258), ('rl', 0.231), ('xt', 0.219), ('mdp', 0.217), ('klrp', 0.204), ('kwik', 0.204), ('rn', 0.167), ('policy', 0.152), ('action', 0.14), ('ns', 0.138), ('state', 0.134), ('yt', 0.134), ('kearns', 0.133), ('timesteps', 0.13), ('agent', 0.118), ('admissible', 0.116), ('regression', 0.114), ('exploration', 0.114), ('reinforcement', 0.113), ('reward', 0.113), ('abbeel', 0.111), ('auer', 0.107), ('ith', 0.107), ('kakade', 0.102), ('polynomial', 0.1), ('ng', 0.096), ('online', 0.092), ('fiechter', 0.087), ('mbrl', 0.087), ('yi', 0.087), ('transpose', 0.086), ('prediction', 0.083), ('output', 0.082), ('helicopter', 0.082), ('timestep', 0.078), ('inputs', 0.077), ('provably', 0.077), ('vmax', 0.076), ('brafman', 0.076), ('tennenholtz', 0.076), ('continuous', 0.075), ('ai', 0.074), ('lemma', 0.073), ('singh', 0.072), ('satisfying', 0.071), ('teacher', 0.069), ('xi', 0.068), ('vm', 0.065), ('input', 0.064), ('valid', 0.063), ('pseudocode', 0.061), ('linear', 0.059), ('backgammon', 0.058), ('rns', 0.058), ('seber', 0.058), ('reset', 0.058), ('barto', 0.058), ('algorithm', 0.058), ('vn', 0.056), ('outputs', 0.054), ('row', 0.052), ('ying', 0.051), ('append', 0.051), ('transition', 0.049), ('ln', 0.048), ('unknown', 0.048), ('complexity', 0.048), ('linearity', 0.047), ('rutgers', 0.046), ('chow', 0.046), ('loan', 0.046), ('tesauro', 0.046), ('na', 0.046), ('norms', 0.046), ('ym', 0.046), ('eigenvalues', 0.045), ('estimator', 0.044), ('learns', 0.043), ('sutton', 0.043), ('apprenticeship', 0.043), ('trusted', 0.043), ('golub', 0.043), ('strehl', 0.043), ('discounted', 0.042), ('sketch', 0.042), ('var', 0.042), ('ps', 0.041), ('tsitsiklis', 0.041), ('squared', 0.041), ('ut', 0.04), ('norm', 0.04), ('whose', 0.039), ('ahead', 0.039), ('discretization', 0.039), ('planning', 0.039), ('rows', 0.038), ('xm', 0.038), ('dealt', 0.037), ('component', 0.037), ('dynamics', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

2 0.23046651 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufﬁciently rich trajectory generated by some policy. We study a variant of ﬁtted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the ﬁrst ﬁnite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is deﬁned by a quintuple (X , A, P, S, γ), where X is the (possible inﬁnite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) deﬁning the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is deﬁned as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to ﬁnd a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisﬁes V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we deﬁne its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), deﬁned by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is deﬁned by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is deﬁned by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a ﬁnite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for ﬁtted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset deﬁning a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is deﬁned by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any ﬁxed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem deﬁned by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the ﬁrst iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to deﬁne the regression problem. Will FQI still work despite these issues? To illustrate the potential difﬁculties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now deﬁne the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the speciﬁc form of fM,U . On the other hand, if we restricted M to a ﬁnite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions deﬁning the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares ﬁtted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a speciﬁc example for F consider neural networks with some ﬁxed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is deﬁned in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for ﬁnite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an inﬁnite number of actions since these previous bounds grow to inﬁnity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is ﬁnite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be inﬁnite over (0, 1/2). Without going into further details, let us just note that the ﬁniteness of the fat-shattering function is a sufﬁcient and necessary condition for learnability and the ﬁniteness of the fat-shattering function is implied by the ﬁniteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difﬁcult to get a ﬁne control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the ﬁtted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, deﬁned by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) deﬁnes the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a ﬁtted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we deﬁne the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we deﬁne the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the ﬁrst term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisﬁes π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefﬁcients do not need to be known. In the case when no mixing condition is satisﬁed, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the deﬁnition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the deﬁnition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisﬁes m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefﬁcient of the future-state distributions. The number c(m) measures how much ρ can get ampliﬁed in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is ﬁnite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is ﬁnite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisﬁed with α = 2−dA . Without assuming any smoothness of the MDP, learning in inﬁnite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly inﬂuence the performance of the algorithm, as we shall see it from the bounds. The ﬁrst assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Deﬁnition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is deﬁned as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we deﬁne VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us ﬁrst consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is deﬁned as follows: For a ﬁxed policy π, the one-step Bellman-error of F w.r.t. T π is deﬁned as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is deﬁned as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the ﬁrst ﬁnite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the ﬁrst analysis of ﬁtted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difﬁcult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based ﬁtted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based ﬁtted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based ﬁtted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artiﬁcial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural ﬁtted Q iteration – ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

3 0.22742474 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

4 0.22350986 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identiﬁcation of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modiﬁes the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

5 0.1746095 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difﬁculty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

6 0.16827755 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

7 0.162398 146 nips-2007-On higher-order perceptron algorithms

8 0.16016218 102 nips-2007-Incremental Natural Actor-Critic Algorithms

9 0.15128137 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

10 0.15085325 21 nips-2007-Adaptive Online Gradient Descent

11 0.14877999 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

12 0.13846695 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

13 0.13242815 30 nips-2007-Bayes-Adaptive POMDPs

14 0.12844783 162 nips-2007-Random Sampling of States in Dynamic Programming

15 0.12550761 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

16 0.12369841 185 nips-2007-Stable Dual Dynamic Programming

17 0.10624776 215 nips-2007-What makes some POMDP problems easy to approximate?

18 0.10241339 100 nips-2007-Hippocampal Contributions to Control: The Third Way

19 0.1020698 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

20 0.099713713 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.313), (1, -0.353), (2, 0.09), (3, 0.017), (4, -0.108), (5, 0.049), (6, 0.033), (7, -0.076), (8, -0.032), (9, 0.073), (10, 0.102), (11, 0.02), (12, 0.016), (13, 0.089), (14, -0.075), (15, 0.047), (16, -0.046), (17, -0.027), (18, 0.033), (19, -0.052), (20, 0.053), (21, -0.042), (22, -0.062), (23, -0.056), (24, -0.036), (25, -0.054), (26, 0.06), (27, -0.002), (28, -0.026), (29, 0.094), (30, 0.005), (31, 0.089), (32, -0.136), (33, 0.095), (34, -0.087), (35, -0.019), (36, -0.049), (37, 0.014), (38, -0.034), (39, 0.045), (40, 0.053), (41, -0.031), (42, -0.032), (43, -0.014), (44, 0.014), (45, -0.044), (46, -0.094), (47, -0.055), (48, 0.014), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95496172 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

2 0.91145271 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

3 0.73866057 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

4 0.71322197 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy

Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacriﬁcing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ ﬁrmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1

5 0.69542468 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

6 0.66653657 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

7 0.61576015 146 nips-2007-On higher-order perceptron algorithms

8 0.60429913 162 nips-2007-Random Sampling of States in Dynamic Programming

9 0.57832932 52 nips-2007-Competition Adds Complexity

10 0.54575491 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

11 0.5423941 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

12 0.538899 185 nips-2007-Stable Dual Dynamic Programming

13 0.52916729 21 nips-2007-Adaptive Online Gradient Descent

14 0.4923546 102 nips-2007-Incremental Natural Actor-Critic Algorithms

15 0.47260305 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

16 0.4722974 163 nips-2007-Receding Horizon Differential Dynamic Programming

17 0.46045813 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

18 0.45675457 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

19 0.45227903 191 nips-2007-Temporal Difference Updating without a Learning Rate

20 0.44279116 171 nips-2007-Scan Strategies for Meteorological Radars

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.048), (13, 0.058), (16, 0.022), (19, 0.013), (21, 0.068), (31, 0.015), (34, 0.043), (35, 0.037), (47, 0.137), (49, 0.039), (56, 0.234), (83, 0.125), (85, 0.018), (87, 0.012), (90, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79481196 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

2 0.66542625 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

3 0.6643545 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufﬁcient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which deﬁnes as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

4 0.66186959 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in ﬂavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1

5 0.66094476 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Author: Matthias Bethge, Philipp Berens

Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1

6 0.65949255 63 nips-2007-Convex Relaxations of Latent Variable Training

7 0.6586991 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

8 0.65832257 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

9 0.65547729 115 nips-2007-Learning the 2-D Topology of Images

10 0.65333724 158 nips-2007-Probabilistic Matrix Factorization

11 0.65301728 43 nips-2007-Catching Change-points with Lasso

12 0.65295392 100 nips-2007-Hippocampal Contributions to Control: The Third Way

13 0.65212142 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

14 0.6508677 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

15 0.65046656 24 nips-2007-An Analysis of Inference with the Universum

16 0.64956129 69 nips-2007-Discriminative Batch Mode Active Learning

17 0.6486817 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

18 0.64810896 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

19 0.64619988 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

20 0.64597058 134 nips-2007-Multi-Task Learning via Conic Programming