nips nips2007 nips2007-5 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Umar Syed, Robert E. Schapire
Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. [sent-6, score-0.542]
2 We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. [sent-7, score-0.212]
3 We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. [sent-8, score-0.506]
4 However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. [sent-9, score-0.468]
5 Once an MDP has been provided, the usual objective is to find a policy (i. [sent-15, score-0.468]
6 a mapping from states to actions) that maximizes expected cumulative reward collected by the agent. [sent-17, score-0.234]
7 One reason is that it is often hard to correctly describe the environment’s true reward function, and yet the behavior of the agent is quite sensitive to this description. [sent-19, score-0.273]
8 In practice, reward functions are frequently tweaked and tuned to elicit what is thought to be the desired behavior. [sent-20, score-0.212]
9 Instead of maximizing reward, another approach often taken is to observe and follow the behavior of an expert in the same environment. [sent-21, score-0.407]
10 Learning how to behave by observing an expert has been called apprenticeship learning, with the agent in the role of the apprentice. [sent-22, score-0.64]
11 In this framework, the reward function, while unknown to the apprentice, is assumed to be equal to a linear combination of a set of known features. [sent-24, score-0.245]
12 They argued that while it may be difficult to correctly describe the reward function, it is usually much easier to specify the features on which the reward function depends. [sent-25, score-0.465]
13 1 With this setting in mind, Abbeel and Ng [1] described an efficient algorithm that, given enough examples of the expert’s behavior, produces a policy that does at least as well as the expert with respect to the unknown reward function. [sent-26, score-1.176]
14 The number of examples their algorithm requires from the expert depends only moderately on the number of features. [sent-27, score-0.423]
15 While impressive, a drawback of their results is that the performance of the apprentice is both upperand lower-bounded by the performance of the expert. [sent-28, score-0.177]
16 If the behavior of the expert is far from optimal, the same will hold for the apprentice. [sent-30, score-0.428]
17 We pose the problem as learning to play a two-player zero-sum game in which the apprentice chooses a policy, and the environment chooses a reward function. [sent-32, score-0.667]
18 The goal of the apprentice is to maximize performance relative to the expert, even though the reward function may be adversarially selected by the environment with respect to this goal. [sent-33, score-0.471]
19 A key property of our algorithm is that it is able to leverage prior beliefs about the relationship between the features and the reward function. [sent-34, score-0.276]
20 Specifically, if it is known whether a feature is “good” (related to reward) or “bad” (inversely related to reward), then the apprentice can use that knowledge to improve its performance. [sent-35, score-0.233]
21 As a result, our algorithm produces policies that can be significantly better than the expert’s policy with respect to the unknown reward function, while at the same time are guaranteed to be no worse. [sent-36, score-0.847]
22 It turns out that our apprenticeship learning setting can be viewed as a game with this property. [sent-39, score-0.355]
23 Moreover, our algorithm requires less computational expense – specifically, we are able to achieve their performance guarantee after only O(ln k) iterations, instead of the O(k ln k), where k is the number of features on which the reward function depends. [sent-41, score-0.397]
24 In that case, our algorithm produces a policy that is optimal in a certain conservative sense. [sent-43, score-0.552]
25 Ratliff et al [3] formulated a related problem to apprenticeship learning, in which the goal is to find a reward function whose optimal policy is similar to the expert’s policy. [sent-46, score-0.885]
26 Quite different from our work, mimicking the expert was an explicit goal of their approach. [sent-47, score-0.431]
27 We are given an infinitehorizon Markov Decision Process in which the reward function has been replaced by a set of features. [sent-49, score-0.212]
28 For any policy π in M , the value of π (with respect to the initial state distribution) is defined by ∞ V (π) γ t R∗ (st ) π, θ, D . [sent-53, score-0.489]
29 Likewise, we say that a policy π is -optimal for M if |V (ˆ ) − V (π ∗ )| ≤ , where π ∗ is an optimal ˆ π policy for M , i. [sent-59, score-0.963]
30 1 We also assume that there is a policy πE , called the expert’s policy, which we are able to observe executing in M . [sent-62, score-0.484]
31 Following Abbeel and Ng [1], our goal is to find a policy π such that V (π) ≥ V (πE ) − , even though the true reward function R∗ is unknown. [sent-63, score-0.7]
32 We also have the additional goal of finding a policy when no observations from the expert’s policy are available. [sent-64, score-0.956]
33 In that case, we find a policy that is optimal in a certain conservative sense. [sent-65, score-0.514]
34 Like Abbeel and Ng [1], the policy we find will not necessarily be stationary, but will instead be a mixed policy. [sent-66, score-0.617]
35 A mixed policy ψ is a distribution over Π, the set of all deterministic stationary policies in M . [sent-67, score-0.764]
36 A mixed policy ψ is executed by randomly selecting the policy π i ∈ Π at time 0 with probability ψ(i), and exclusively following π i thereafter. [sent-73, score-1.085]
37 It should be noted that the definitions of value and feature expectations apply to mixed policies as well: V (ψ) = Ei∼ψ [V (π i )] and µ(ψ) = Ei∼ψ [µ(π i )]. [sent-74, score-0.368]
38 Also note that mixed policies do not have any advantage over stationary policies in terms of value: if π ∗ is an optimal stationary policy for M , and ψ ∗ is an optimal mixed policy, then V (ψ ∗ ) = V (π ∗ ). [sent-75, score-1.056]
39 The observations from the expert’s policy πE are in the form of m independent trajectories in M , each for simplicity of the same length H. [sent-76, score-0.528]
40 t i=0 t=0 Review of the Projection Algorithm We compare our approach to the “projection algorithm” of Abbeel and Ng [1], which finds a policy that is at least as good as the expert’s policy with respect to the unknown reward function. [sent-83, score-1.221]
41 Given m independent trajectories from the expert’s policy, the projection algorithm runs for T iterations. [sent-85, score-0.197]
42 It returns a mixed policy ψ such that µ(ψ) − µE 2 ≤ as long as T and m are sufficiently large. [sent-86, score-0.617]
43 Note that this is weaker than the standard definition of optimality, as the policy only needs to be optimal with respect to the initial state distribution, and not necessarily at every state simultaneously. [sent-90, score-0.516]
44 Given an MDP\R, and m independent trajectories from an expert’s policy πE . [sent-97, score-0.528]
45 Suppose we execute the projection algorithm for T iterations. [sent-98, score-0.176]
46 Let ψ be the mixed policy returned by the algorithm. [sent-99, score-0.654]
47 Then in order for |V (ψ) − V (πE )| ≤ (2) to hold with probability at least 1 − δ, it suffices that T ≥O k k ln ( (1 − γ))2 (1 − γ) and m≥ 2k 2k ln . [sent-100, score-0.228]
48 Find an optimal policy with respect to a given reward function. [sent-102, score-0.728]
49 In particular, suppose that each sample trajectory has length H ≥ (1/(1 − γ)) ln(1/( H (1 − γ))), and that an P -optimal policy is found in each iteration of the projection algorithm (see Step 1 above). [sent-109, score-0.66]
50 4 Apprenticeship Learning via Game Playing Notice the two-sided bound in Theorem 1: the theorem guarantees that the apprentice will do almost as well as the expert, but also almost as badly. [sent-114, score-0.221]
51 This is because the value of a policy is a linear combination of its feature expectations, and the goal of the projection algorithm is to match the expert’s feature expectations. [sent-115, score-0.737]
52 ψ∈Ψ w∈Sk (3) Our goal will be to find (actually, to approximate) the mixed policy ψ ∗ that achieves v ∗ . [sent-122, score-0.637]
53 Since V (ψ) = w∗ · µ(ψ) for all ψ, we have that ψ ∗ is the policy in Ψ that maximizes V (ψ) − V (πE ) with respect to the worst-case possibility for w∗ . [sent-123, score-0.53]
54 The quantity v ∗ is typically called the game value. [sent-130, score-0.213]
55 In this game, the “min player” specifies a reward function by choosing w, and the “max player” chooses a mixed policy ψ. [sent-131, score-0.849]
56 The goal of the min player is to cause the max player’s policy to perform as poorly as possible relative to the expert, and the max player’s goal is just the opposite. [sent-132, score-0.72]
57 In our case, the game matrix is the k × |Π| matrix G(i, j) = µj (i) − µE (i) (4) where µ(i) is the ith component of µ and we have let µj = µ(π j ) be the vector of feature expectations for the jth deterministic policy π j . [sent-134, score-0.855]
58 (3) and (5), the max player plays first, suggesting that the min player has an advantage. [sent-138, score-0.269]
59 that ψ ∗ will do at least as well as the expert’s policy with respect to the worst-case possibility for w∗ . [sent-144, score-0.527]
60 This fact is not immediately clear, since we are restricting ourselves to mixtures of deterministic policies, while we do not assume that the expert’s policy is deterministic. [sent-145, score-0.497]
61 (6), the maximization over Ψ is done after w — and hence the reward function — has been fixed. [sent-147, score-0.212]
62 So the maximum is achieved by the best policy in Ψ with respect to this fixed reward function. [sent-148, score-0.701]
63 So when v ∗ > 0, the mixed policy ψ ∗ to some extent ignores the expert, and instead exploits prior knowledge about the true reward function encoded by the features. [sent-159, score-0.829]
64 5 The Multiplicative Weights for Apprenticeship Learning (MWAL) Algorithm In the previous section, we motivated the goal of finding the mixed policy ψ ∗ that achieves the maximum in Eq. [sent-161, score-0.637]
65 In the terminology of game theory, w and ψ are called strategies for the min and max player respectively , and ψ ∗ is called an optimal strategy for ˙ ˙ the max player. [sent-166, score-0.496]
66 Typically, one finds an optimal strategy for a two-player zero-sum game by solving a linear program. [sent-168, score-0.252]
67 However, the complexity of that approach scales with the size of the game matrix. [sent-169, score-0.197]
68 In our case, the game matrix G is huge, since it has as many columns as the number of deterministic policies in the MDP\R. [sent-170, score-0.301]
69 Freund and Schapire [2] described a multiplicative weights algorithm for finding approximately optimal strategies in games with large or even unknown game matrices. [sent-171, score-0.369]
70 To apply their algorithm to a game matrix G, it suffices to be able to efficiently perform the following two steps: 1. [sent-172, score-0.235]
71 Given a max player strategy ψ, compute wT Gψ for each pure strategy w. [sent-175, score-0.215]
72 Observe that these two steps are equivalent to the two steps of the projection algorithm from Section 3. [sent-176, score-0.179]
73 Step 1 amounts to finding the optimal policy in a standard MDP with a known reward function. [sent-177, score-0.707]
74 There are a huge array of techniques available for this, such as value iteration and policy iteration. [sent-178, score-0.511]
75 Importantly, the complexity of both steps scales with the size of the MDP\R, and not with the size of the game matrix G. [sent-181, score-0.218]
76 The algorithm is essentially the MW algorithm of Freund and Schapire [2], applied to a game matrix very similar to G. [sent-184, score-0.273]
77 4 We have also slightly extended their results to allow the MWAL algorithm, in lines 7 and 8, to estimate the optimal policy and its feature expectations, rather than requiring that they be computed exactly. [sent-185, score-0.551]
78 Algorithm 1 The MWAL algorithm ˆ 1: Given: An MDP\R M and an estimate of the expert’s feature expectations µE . [sent-186, score-0.182]
79 φ 7: Compute an P -optimal policy π (t) for M with respect to reward function R(s) = w(t) ·φ (s). [sent-200, score-0.701]
80 10: end for 1 11: Post-processing: Return the mixed policy ψ that assigns probability T to π (t) , for all t ∈ ˆ {1, . [sent-206, score-0.617]
81 Theorem 2 below provides a performance guarantee for the mixed policy ψ returned by the MWAL algorithm, relative to the performance of the expert and the game value v ∗ . [sent-210, score-1.263]
82 Given an MDP\R M , and m independent trajectories from an expert’s policy πE . [sent-214, score-0.528]
83 Let ψ be the mixed policy returned by the algorithm. [sent-216, score-0.654]
84 Let v ∗ = maxψ∈Ψ minw∈Sk [w · µ(ψ) − w · µE ] be the game value. [sent-220, score-0.197]
85 Then in order for V (ψ) ≥ V (πE ) + v ∗ − to hold with probability at least 1 − δ, it suffices that 9 ln k T ≥ 2( (1 − γ))2 2 2k m ≥ ln ( (1 − γ))2 δ (7) (8) (9) (10) where ≤ − (2 F + P +2 H +2 3 R /(1 − γ)) . [sent-221, score-0.228]
86 Because v ∗ ≥ 0, the guarantee of the MWAL algorithm in (7) is at least as strong as the guarantee of the projection algorithm in (2), and has the further benefit of being one-sided. [sent-227, score-0.248]
87 This not only implies a faster run time, but also implies that the mixed policy output by the MWAL algorithm consists of fewer stationary policies. [sent-229, score-0.698]
88 And if a purely stationary policy is desired, it is not hard to show that the guarantee in (7) must hold for at least one of the stationary polices in the mixed policy (this is also true of the projection algorithm [1]). [sent-230, score-1.375]
89 1 When no expert is available Our game-playing approach can be very naturally and easily extended to the case where we do not have data from an expert. [sent-233, score-0.385]
90 (3), we find a policy ψ ∗ that maximizes max min [w · µ(ψ)] . [sent-235, score-0.563]
91 (12) ψ∈Ψ w∈Sk ∗ Here ψ is the best policy for the worst-case possibility for w∗ . [sent-236, score-0.487]
92 The MWAL algorithm can be trivially adapted to find this policy just by setting µE = 0 (compare (12) to (3)). [sent-237, score-0.506]
93 Let ψ be the mixed policy returned by the algorithm. [sent-242, score-0.654]
94 The same class of reward functions can be expressed under either set of assumptions by roughly doubling the number of features. [sent-248, score-0.243]
95 Notably, employing this reduction forces the game value v ∗ to be zero, ensuring that the MWAL algorithm, like the projection algorithm, will mimic the expert. [sent-262, score-0.296]
96 7 Experiments For ease of comparison, we tested the MWAL algorithm and the projection algorithm in a car driving simulator that resembled the experimental setup from Abbeel and Ng [1]. [sent-274, score-0.295]
97 In our simulator, the apprentice must navigate a car through randomly-generated traffic on a threelane highway. [sent-276, score-0.216]
98 In both cases, the MWAL algorithm leverages information encoded in the features to produce a policy that is significantly better than the expert’s policy. [sent-288, score-0.532]
99 In that case, it produced a policy that drives as fast as possible without risking any collisions or off-roads. [sent-291, score-0.603]
100 Given our features, this is indeed the best policy for the worst-case choice of reward function. [sent-292, score-0.68]
wordName wordTfidf (topN-words)
[('policy', 0.468), ('mwal', 0.442), ('expert', 0.385), ('abbeel', 0.248), ('reward', 0.212), ('game', 0.197), ('apprentice', 0.177), ('apprenticeship', 0.158), ('mixed', 0.149), ('mdp', 0.132), ('ng', 0.13), ('projection', 0.099), ('player', 0.098), ('ln', 0.094), ('expectations', 0.088), ('schapire', 0.081), ('sk', 0.077), ('policies', 0.075), ('collisions', 0.071), ('mz', 0.071), ('supplement', 0.07), ('minw', 0.068), ('trajectories', 0.06), ('feature', 0.056), ('maxs', 0.053), ('freund', 0.051), ('theorem', 0.044), ('stationary', 0.043), ('drives', 0.042), ('environment', 0.041), ('max', 0.041), ('st', 0.04), ('execute', 0.039), ('bk', 0.039), ('car', 0.039), ('princeton', 0.039), ('algorithm', 0.038), ('games', 0.037), ('returned', 0.037), ('multiplicative', 0.037), ('syed', 0.035), ('driving', 0.035), ('wt', 0.035), ('rk', 0.033), ('unknown', 0.033), ('playing', 0.033), ('visited', 0.033), ('min', 0.032), ('fk', 0.031), ('doubling', 0.031), ('olden', 0.031), ('behave', 0.03), ('kearns', 0.03), ('transition', 0.03), ('deterministic', 0.029), ('simulator', 0.028), ('strategy', 0.028), ('optimal', 0.027), ('observing', 0.027), ('guarantee', 0.027), ('singh', 0.026), ('mimicking', 0.026), ('ratliff', 0.026), ('features', 0.026), ('speed', 0.025), ('bad', 0.025), ('speeds', 0.025), ('agent', 0.024), ('si', 0.024), ('medium', 0.023), ('ces', 0.023), ('maximizes', 0.022), ('iteration', 0.022), ('behavior', 0.022), ('fast', 0.022), ('respect', 0.021), ('steps', 0.021), ('huge', 0.021), ('hold', 0.021), ('reinforcement', 0.021), ('things', 0.02), ('chooses', 0.02), ('goal', 0.02), ('pure', 0.02), ('strict', 0.02), ('possibility', 0.019), ('conservative', 0.019), ('sec', 0.019), ('least', 0.019), ('ease', 0.018), ('likewise', 0.018), ('let', 0.017), ('ei', 0.017), ('sketch', 0.017), ('suppose', 0.017), ('trajectory', 0.016), ('called', 0.016), ('nding', 0.016), ('correctly', 0.015), ('collision', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
Author: Umar Syed, Robert E. Schapire
Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1
2 0.33918735 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng
Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1
3 0.27069959 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
4 0.25111309 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
5 0.22742474 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
Author: Alexander L. Strehl, Michael L. Littman
Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.
6 0.20622902 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
7 0.19414766 162 nips-2007-Random Sampling of States in Dynamic Programming
8 0.15613241 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
9 0.15250193 185 nips-2007-Stable Dual Dynamic Programming
10 0.1465814 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
11 0.13350014 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
12 0.11946229 55 nips-2007-Computing Robust Counter-Strategies
13 0.1163986 165 nips-2007-Regret Minimization in Games with Incomplete Information
14 0.09810292 215 nips-2007-What makes some POMDP problems easy to approximate?
15 0.093465693 86 nips-2007-Exponential Family Predictive Representations of State
16 0.084077768 52 nips-2007-Competition Adds Complexity
17 0.079799861 163 nips-2007-Receding Horizon Differential Dynamic Programming
18 0.079495192 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
19 0.078878306 30 nips-2007-Bayes-Adaptive POMDPs
20 0.068447754 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
topicId topicWeight
[(0, -0.196), (1, -0.422), (2, 0.088), (3, -0.079), (4, -0.142), (5, 0.116), (6, 0.014), (7, -0.048), (8, 0.044), (9, 0.196), (10, -0.074), (11, -0.17), (12, 0.048), (13, -0.007), (14, 0.012), (15, 0.067), (16, -0.041), (17, 0.11), (18, -0.026), (19, 0.008), (20, 0.039), (21, 0.071), (22, -0.066), (23, 0.019), (24, -0.005), (25, -0.065), (26, 0.027), (27, -0.077), (28, -0.001), (29, 0.002), (30, -0.139), (31, 0.062), (32, -0.111), (33, 0.043), (34, -0.054), (35, -0.071), (36, 0.06), (37, 0.002), (38, -0.088), (39, -0.015), (40, 0.067), (41, -0.015), (42, -0.09), (43, -0.036), (44, -0.179), (45, -0.063), (46, -0.003), (47, -0.1), (48, -0.018), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.9741109 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
Author: Umar Syed, Robert E. Schapire
Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1
2 0.84652913 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng
Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1
3 0.66973567 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy
Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1
4 0.63594842 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
Author: András Antos, Csaba Szepesvári, Rémi Munos
Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by d 1+1 κ+1 A 4κ (log N + log(K/δ)) + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4 N where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8
5 0.62486774 52 nips-2007-Competition Adds Complexity
Author: Judy Goldsmith, Martin Mundhenk
Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .
6 0.60159278 162 nips-2007-Random Sampling of States in Dynamic Programming
7 0.58907944 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
8 0.56750864 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
9 0.56737083 102 nips-2007-Incremental Natural Actor-Critic Algorithms
10 0.55315518 185 nips-2007-Stable Dual Dynamic Programming
11 0.55085224 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
12 0.42196137 163 nips-2007-Receding Horizon Differential Dynamic Programming
13 0.42001 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
14 0.36934495 191 nips-2007-Temporal Difference Updating without a Learning Rate
15 0.33673355 55 nips-2007-Computing Robust Counter-Strategies
16 0.28588024 165 nips-2007-Regret Minimization in Games with Incomplete Information
17 0.28391168 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
18 0.26259434 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
19 0.2510561 86 nips-2007-Exponential Family Predictive Representations of State
20 0.24872233 215 nips-2007-What makes some POMDP problems easy to approximate?
topicId topicWeight
[(5, 0.031), (13, 0.064), (16, 0.011), (21, 0.068), (29, 0.012), (31, 0.024), (34, 0.038), (35, 0.034), (46, 0.033), (47, 0.122), (49, 0.012), (83, 0.093), (84, 0.255), (85, 0.026), (87, 0.01), (90, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.72313082 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
Author: Umar Syed, Robert E. Schapire
Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1
2 0.574705 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
3 0.57427233 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
4 0.56730002 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
5 0.56462687 100 nips-2007-Hippocampal Contributions to Control: The Third Way
Author: Máté Lengyel, Peter Dayan
Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1
6 0.56330806 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
7 0.56280488 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
8 0.56135547 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.5570299 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
10 0.55648804 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
11 0.55641085 158 nips-2007-Probabilistic Matrix Factorization
12 0.55635971 69 nips-2007-Discriminative Batch Mode Active Learning
13 0.55577797 84 nips-2007-Expectation Maximization and Posterior Constraints
14 0.55390054 169 nips-2007-Retrieved context and the discovery of semantic structure
15 0.55239689 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
16 0.55158389 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
17 0.55129665 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
18 0.55118656 134 nips-2007-Multi-Task Learning via Conic Programming
19 0.55079699 24 nips-2007-An Analysis of Inference with the Universum
20 0.55053782 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model