nips nips2010 nips2010-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Error Propagation for Approximate Policy and Value Iteration R´ mi Munos e Sequel Project, INRIA Lille Lille, France remi. [sent-1, score-0.056]
2 ca Csaba Szepesv´ ri ∗ a Department of Computing Science University of Alberta Edmonton, Canada, T6G 2E8 szepesva@ualberta. [sent-4, score-0.049]
3 ca Abstract We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. [sent-5, score-0.244]
4 We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. [sent-6, score-0.233]
5 Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. [sent-7, score-0.126]
6 Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. [sent-8, score-0.236]
7 1 Introduction The exact solution for the reinforcement learning (RL) and planning problems with large state space is difficult or impossible to obtain, so one usually has to aim for approximate solutions. [sent-9, score-0.124]
8 AVI starts from an initial value function V0 (or Q0 ), and iteratively applies an approximation of T ∗ , the Bellman optimality operator, (or T π for the policy evaluation problem) to the previous estimate, i. [sent-12, score-0.47]
9 In general, Vk+1 is not equal to T ∗ Vk because (1) we do not have direct access to the Bellman operator but only some samples from it, and (2) the function space in which V belongs is not representative enough. [sent-15, score-0.069]
10 Thus there would be an approximation error εk = T ∗ Vk − Vk+1 between the result of the exact VI and AVI. [sent-16, score-0.081]
11 See the work of Munos and Szepesv´ ri [4] for more information about AVI. [sent-20, score-0.049]
12 a ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. [sent-21, score-0.049]
13 1 API is another iterative algorithm to find an approximate solution to the fixed point of the Bellman optimality operator. [sent-24, score-0.069]
14 It starts from a policy π0 , and then approximately evaluates that policy π0 , i. [sent-25, score-0.79]
15 Afterwards, it performs a policy improvement step, which is to calculate the greedy policy with respect to (w. [sent-28, score-0.872]
16 ) the most recent action-value function, to get a new policy π1 , i. [sent-31, score-0.395]
17 The policy iteration algorithm continues by approximately evaluating the newly obtained policy π1 to get Q1 and repeating the whole process again, generating a sequence of policies and their corresponding approximate action-value functions Q0 → π1 → Q1 → π2 → · · · . [sent-34, score-1.006]
18 Same as AVI, we may encounter a difference between the approximate solution Qk (T πk Qk ≈ Qk ) and the true value of the policy Qπk , which is the solution of the fixed-point equation T πk Qπk = Qπk . [sent-35, score-0.425]
19 Two convenient ways to describe this error is either by the Bellman residual of Qk (εk = Qk − T πk Qk ) or the policy evaluation approximation error (εk = Qk − Qπk ). [sent-36, score-0.61]
20 One well-known algorithm is LSPI of Lagoudakis and Parr [5] that combines Least-Squares Temporal Difference (LSTD) algorithm (Bradtke and Barto [6]) with a policy improvement step. [sent-38, score-0.43]
21 Another API method is to use the Bellman Residual Minimization (BRM) and its variants for policy evaluation and iteratively apply the policy improvement step (Antos et al. [sent-39, score-0.856]
22 A crucial question in the applicability of API/AVI, which is the main topic of this work, is to understand how either the approximation error or the Bellman residual at each iteration of API or AVI affects the quality of the resulted policy. [sent-49, score-0.289]
23 Suppose we run API/AVI for K iterations to obtain a policy πK . [sent-50, score-0.445]
24 If so, how does the errors occurred at a certain iteration k propagate through iterations of API/AVI and affect the final performance loss? [sent-52, score-0.205]
25 2 of Bertsekas and Tsitsiklis [16] shows that for API applied to a finite MDP, we have 2γ lim supk→∞ V ∗ − V πk ∞ ≤ (1−γ)2 lim supk→∞ V πk − Vk ∞ where γ is the discount facto. [sent-55, score-0.086]
26 Similarly for AVI, if the approximation errors are uniformly bounded ( T ∗ Vk − Vk+1 ∞ ≤ ε), we 2γ have lim supk→∞ V ∗ − V πk ∞ ≤ (1−γ)2 ε (Munos [17]). [sent-56, score-0.124]
27 One reason is that they are expressed as the supremum norm of the approximation errors V πk − Vk ∞ or the Bellman error Qk − T πk Qk ∞ . [sent-58, score-0.268]
28 Compared to Lp norms, the supremum norm is conservative. [sent-59, score-0.142]
29 It is quite possible that the result of a learning algorithm has a small Lp norm but a very large L∞ norm. [sent-60, score-0.062]
30 Therefore, it is desirable to have a result expressed in Lp norm of the approximation/Bellman residual εk . [sent-61, score-0.151]
31 In the past couple of years, there have been attempts to extend L∞ norm results to Lp ones [18, 17, 7]. [sent-62, score-0.062]
32 A mapping π : X → A is called a deterministic Markov stationary policy, or just a policy in short. [sent-70, score-0.427]
33 Following a policy π in an MDP means that at each time step At = π(Xt ). [sent-71, score-0.395]
34 Upon taking action At at Xt , we receive reward Rt ∼ R(·|x, a), and the Markov chain evolves according to Xt+1 ∼ P (·|Xt , At ). [sent-72, score-0.044]
35 We denote the probability transition kernel of following a policy π by P π , i. [sent-73, score-0.395]
36 The value function V π for a policy π is defined as V π (x) action-value function is defined as Qπ (x, a) ∞ t=0 E E ∞ t=0 γ t Rt X0 = x and the γ t Rt X0 = x, A0 = a . [sent-76, score-0.395]
37 We say that a policy π ∗ is optimal ∗ if it achieves the best values in every state, i. [sent-78, score-0.395]
38 Greedy policies are important because a greedy policy w. [sent-89, score-0.512]
39 3 We define the Bellman operator for a policy π as (T π V )(x) r(x, π(x)) + γ V π (x )P (dx |x, a) and (T π Q)(x, a) r(x, a) + γ Q(x , π(x ))P (dx |x, a). [sent-96, score-0.436]
40 Similarly, the Bellman optimality operator is defined as (T ∗ V )(x) r(x, a) + γ maxa r(x, a) + γ V (x )P (dx |x, a) and (T ∗ Q)(x, a) maxa Q(x , a )P (dx |x, a). [sent-97, score-0.194]
41 For a measurable space X , with a σ-algebra σX , we define M(X ) as the set of all probability measures over σX . [sent-98, score-0.055]
42 In what follows we shall use V p,ν to denote the Lp (ν)-norm of a measurable function V : X → R: p p V p,ν ν|V |p |V (x)|p dν(x). [sent-101, score-0.055]
43 |A| 3 Approximate Policy Iteration Consider the API procedure and the sequence Q0 → π1 → Q1 → π2 → · · · → QK−1 → πK , where πk is the greedy policy w. [sent-103, score-0.481]
44 Qk−1 and Qk is the approximate action-value function for policy πk . [sent-106, score-0.425]
45 For the sequence {Qk }K−1 , denote the Bellman Residual (BR) and policy Approximation Error k=0 (AE) at each iteration by εBR = Qk − T πk Qk , k εAE k (1) πk = Qk − Q . [sent-107, score-0.511]
46 (2) The goal of this section is to study the effect of ν-weighted L2p norm of the Bellman residual K−1 sequence {εBR }K−1 or the policy evaluation approximation error sequence {εAE }k=0 on the perk k k=0 ∗ πK formance loss Q − Q p,ρ of the outcome policy πK . [sent-108, score-1.146]
47 The choice of ρ and ν is arbitrary, however, a natural choice for ν is the sampling distribution of the data, which is used by the policy evaluation module. [sent-109, score-0.426]
48 Because of the dynamical nature of MDP, the performance loss Q∗ − QπK p,ρ depends on the difference between the sampling distribution ν and the future-state distribution in the form of ρP π1 P π2 · · · . [sent-112, score-0.077]
49 Before stating the results, we require to define the following concentrability coefficients. [sent-114, score-0.305]
50 Given ρ, ν ∈ M(X ), ν λ1 (λ is the Lebesgue measure), m ≥ 0, and an arbitrary sequence of stationary policies {πm }m≥1 , let ρP π1 P π2 . [sent-116, score-0.141]
51 P πm ∈ M(X ) denote the future-state distribution obtained when the first state is distributed according to ρ and then we follow the sequence of policies {πk }m . [sent-119, score-0.14]
52 4 ∗ with the understanding that if the future-state distribution ρ(P π )m1 (P π )m2 (or ∗ ∗ ρ(P π )m1 (P π1 )m2 P π2 or ρP π ) is not absolutely continuous w. [sent-121, score-0.072]
53 Also define the following concentrability coefficient that is used in AVI analysis: 1 2 2 ∗ d ρ(P π )m1 (P π )m2 cVI,ρ,ν (m1 , m2 ; π) EX∼ν (X) , dν ∗ with the understanding that if the future-state distribution ρ(P π )m1 (P π )m2 is not absolutely continuous w. [sent-125, score-0.377]
54 Then for any sequence {Qk }k=0 ⊂ B(X × A, Qmax ) (space of Qmax -bounded K−1 measurable functions defined on X × A) and the corresponding sequence {εk }k=0 defined in (1) or (2) , we have Q∗ − QπK p,ρ ≤ 2γ (1 − γ)2 where E(ε0 , . [sent-132, score-0.133]
55 (a) If εk = εBR for all 0 ≤ k < K, we have CPI(BR),ρ,ν (K; r) = ( 1−γ 2 ) sup 2 π0 ,. [sent-140, score-0.064]
56 (b) If εk = εAE for all 0 ≤ k < K, we have CPI(AE),ρ,ν (K; r, s) = ( 1−γ 2 ) sup 2 π0 ,. [sent-144, score-0.064]
57 Denote the approximation error caused at each iteration by εk = T ∗ Vk − Vk+1 . [sent-150, score-0.158]
58 (3) The goal of this section is to analyze AVI procedure and to relate the approximation error sequence {εk }K−1 to the performance loss V ∗ − V πK p,ρ of the obtained policy πK , which is the greedy k=0 policy w. [sent-151, score-1.034]
59 Then for any sequence {Vk }K−1 ⊂ B(X , Vmax ), and the corresponding sequence k=0 {εk }K−1 defined in (3), we have k=0 V ∗ − V πK p,ρ ≤ 2γ (1 − γ)2 1 1 2p inf CVI,ρ,ν (K; r)E 2p (ε0 , . [sent-158, score-0.078]
60 , εK−1 ; r) + r∈[0,1] K 2 γ p Rmax , 1−γ where CVI,ρ,ν (K; r) = ( 1−γ 2 ) sup 2 π and E(ε0 , . [sent-161, score-0.064]
61 1 Lp norm instead of L∞ norm As opposed to most error upper bounds, Theorems 3 and 4 relate V ∗ − V πK p,ρ to the Lp norm of the approximation or Bellman errors εk 2p,ν of iterations in API/AVI. [sent-167, score-0.393]
62 The (1−γ)2 lim supk→∞ V use of Lp norm not only is a huge improvement over conservative supremum norm, but also allows us to benefit from the vast literature on supervised learning techniques, which usually provides error upper bounds in the form of Lp norms, in the context of RL/Planning problems. [sent-170, score-0.292]
63 This is especially interesting for the case of p = 1 as the performance loss V ∗ − V πK 1,ρ is the difference between the expected return of the optimal policy and the resulted policy πK when the initial state distribution is ρ. [sent-171, score-0.909]
64 Convenient enough, the errors appearing in the upper bound are in the form of εk 2,ν which is very common in the supervised learning literature. [sent-172, score-0.045]
65 2 Expected versus supremum concentrability of the future-state distribution The concentrability coefficients (Definition 2) reflect the effect of future-state distribution on the performance loss V ∗ − V πK p,ρ . [sent-175, score-0.736]
66 Previously it was thought that the key contributing factor to the performance loss is the supremum of the Radon-Nikodym derivative of these two distributions. [sent-176, score-0.155]
67 Nevertheless, it turns out that the key contributing factor that determines the performance loss is the expectation of the squared Radon-Nikodym derivative instead of its supremum. [sent-178, score-0.075]
68 Intuitively this π m implies that even if for some subset of X ⊂ X the ratio d(ρ(P ) ) is large but the probability ν(X ) dν is very small, performance loss due to it is still small. [sent-179, score-0.046]
69 As an illustration of this difference, consider a Chain Walk with 1000 states with a single policy that 1 drifts toward state 1 of the chain. [sent-181, score-0.426]
70 The growth and the final value of the expectation-based concentrability coefficient is much smaller than that of supremum-based. [sent-189, score-0.305]
71 6 3 5 4 Infinity norm−based concentrability Expectation−base concentrability Uniform Exponential 3 2 2 10 L1 error Concentrability Coefficients 10 1 1. [sent-190, score-0.655]
72 ] It is easy to show that if the Chain Walk has N states and the policy has the same concentrating √ π m π m 1 behavior and ν is uniform, then || d(ρ(P ) ) ||∞ → N , while (EX∼ν | d(ρ(P ) ) |2 ) 2 → N when dν dν √ m → ∞. [sent-200, score-0.395]
73 Neglecting all other differences between our results and the previous ones, we get a performance upper bound in the form of Q∗ − QπK 1,ρ ≤ c1 (γ)O(N 1/4 ) supk εk 2,ν , while Proposition 1 implies that Q∗ − QπK 1,ρ ≤ c2 (γ)O(N 1/2 ) supk || k ||2,ν . [sent-206, score-0.284]
74 3 Error decaying property Theorems 3 and 4 show that the dependence of performance loss ∗ πK K−1 k=0 K−1 p,ρ ) on {εk }k=0 V ∗ − V πK 2r αk p,ρ (or 2p εk 2p,ν . [sent-209, score-0.046]
75 , εK−1 ; r) = This has a very special structure in that the approximation errors at later iterations have more contribution to the final performance loss. [sent-213, score-0.131]
76 This behavior is obscure in previous results such as [17, 7] that the dependence of the final performance loss is expressed as E(ε0 , . [sent-214, score-0.046]
77 It says that it is better to put more effort on having a lower Bellman or approximation error at later iterations of API/AVI. [sent-222, score-0.131]
78 This, for instance, can be done by gradually increasing the number of samples throughout iterations, or to use more powerful, and possibly computationally more expensive, function approximators for the later iterations of API/AVI. [sent-223, score-0.115]
79 To illustrate this property, we compare two different sampling schedules on a simple MDP. [sent-224, score-0.068]
80 The MDP is a 100-state, 2-action chain similar to Chain Walk problem in the work of Lagoudakis and Parr [5]. [sent-225, score-0.044]
81 In the first sampling schedule, every 20 iterations we generate a fixed number of fresh samples by following a uniformly random walk on the chain (this means that we throw away old samples). [sent-227, score-0.196]
82 In the exponential strategy, we again generate new samples every 20 iterations but the number of samples at the k th iteration is ck γ . [sent-229, score-0.183]
83 The improvement of the exponential sampling schedule is evident. [sent-234, score-0.098]
84 Of course, one 7 200 may think of more sophisticated sampling schedules but this simple illustration should be sufficient to attract the attention of practitioners to this phenomenon. [sent-235, score-0.068]
85 4 Restricted search over policy space One interesting feature of our results is that it puts more structure and restriction on the way policies may be selected. [sent-237, score-0.465]
86 Comparing CPI,ρ,ν (K; r) (Theorem 3) and CVI,ρ,ν (K; r) (Theorem 4) with Cρ,ν (Proposition 1) we see that: (1) Each concentrability coefficient in the definition of CPI,ρ,ν (K; r) depends only on a single or two policies (e. [sent-238, score-0.375]
87 (2) The operator sup in CPI,ρ,ν and CVI,ρ,ν appears outside the summation. [sent-246, score-0.105]
88 On the other other hand, sup appears inside the summation in the definition of Cρ,ν . [sent-251, score-0.064]
89 One may construct an MDP that this difference in the ordering of sup leads to an arbitrarily large ratio of two different ways of defining the concentrability coefficients. [sent-252, score-0.369]
90 For general MDPs, the computation of concentrability coefficients in Definition 2 is difficult, as it is for similar coefficients defined in [18, 17, 7]. [sent-256, score-0.305]
91 In our Theorems 3 and 4, we have provided upper bounds that relate the errors at each iteration of API/AVI to the performance loss of the whole procedure. [sent-267, score-0.226]
92 These bounds are qualitatively tighter than the previous results such as those reported by [18, 17, 7], and provide a better understanding of what factors contribute to the difficulty of the problem. [sent-268, score-0.056]
93 Perhaps the most important one is to study the behavior of concentrability coefficients cPI1 ,ρ,ν (m1 , m2 ; π), cPI2 ,ρ,ν (m1 , m2 ; π1 , π2 ), and cVI,ρ,ν (m1 , m2 ; π) as a function of m1 , m2 , and of course the transition kernel P of MDP. [sent-271, score-0.305]
94 A better understanding of this question alongside a good understanding of the way each term εk in E(ε0 , . [sent-272, score-0.058]
95 , εK−1 ; r) behaves, help us gain more insight about the error convergence behavior of the RL/Planning algorithms. [sent-275, score-0.045]
96 Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. [sent-280, score-0.14]
97 [7] Andr´ s Antos, Csaba Szepesv´ ri, and R´ mi Munos. [sent-297, score-0.056]
98 Learning near-optimal policies with a a e Bellman-residual minimization based fitted policy iteration and a single sample path. [sent-298, score-0.542]
99 [8] Odalric Maillard, R´ mi Munos, Alessandro Lazaric, and Mohammad Ghavamzadeh. [sent-300, score-0.056]
100 Performance bounds in lp norm for approximate value iteration. [sent-346, score-0.242]
wordName wordTfidf (topN-words)
[('policy', 0.395), ('api', 0.324), ('qk', 0.32), ('concentrability', 0.305), ('vk', 0.215), ('bellman', 0.206), ('avi', 0.205), ('cvi', 0.203), ('supk', 0.142), ('cpi', 0.139), ('lp', 0.123), ('csaba', 0.111), ('szepesv', 0.104), ('ex', 0.102), ('qmax', 0.102), ('ae', 0.093), ('residual', 0.089), ('farahmand', 0.088), ('supremum', 0.08), ('munos', 0.079), ('rmax', 0.078), ('iteration', 0.077), ('policies', 0.07), ('brm', 0.07), ('proposition', 0.069), ('mdp', 0.068), ('br', 0.067), ('antos', 0.066), ('sup', 0.064), ('coef', 0.064), ('reinforcement', 0.063), ('norm', 0.062), ('dx', 0.06), ('bertsekas', 0.057), ('maxa', 0.057), ('mi', 0.056), ('lspi', 0.056), ('maillard', 0.056), ('measurable', 0.055), ('mohammad', 0.053), ('iterations', 0.05), ('ri', 0.049), ('lagoudakis', 0.048), ('fitted', 0.048), ('greedy', 0.047), ('geramifard', 0.046), ('ilstd', 0.046), ('loss', 0.046), ('parr', 0.046), ('error', 0.045), ('cients', 0.045), ('errors', 0.045), ('chain', 0.044), ('walk', 0.043), ('lim', 0.043), ('absolutely', 0.043), ('resulted', 0.042), ('operator', 0.041), ('slt', 0.041), ('vmax', 0.041), ('tted', 0.04), ('sequence', 0.039), ('theorems', 0.039), ('optimality', 0.039), ('approximators', 0.037), ('shie', 0.037), ('ernst', 0.037), ('schedules', 0.037), ('lille', 0.037), ('approximation', 0.036), ('improvement', 0.035), ('mahadevan', 0.035), ('bradtke', 0.035), ('jung', 0.035), ('propagate', 0.033), ('stationary', 0.032), ('dy', 0.032), ('edmonton', 0.032), ('dimitri', 0.032), ('schedule', 0.032), ('ghavamzadeh', 0.032), ('lstd', 0.032), ('et', 0.031), ('propagation', 0.031), ('relate', 0.031), ('state', 0.031), ('sampling', 0.031), ('tsitsiklis', 0.03), ('approximate', 0.03), ('steven', 0.029), ('contributing', 0.029), ('ronald', 0.029), ('understanding', 0.029), ('freedom', 0.029), ('rl', 0.029), ('samples', 0.028), ('alberta', 0.028), ('bounds', 0.027), ('rt', 0.027), ('xt', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
2 0.31402352 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
3 0.29490611 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
4 0.28598681 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
5 0.27023819 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
6 0.26205698 152 nips-2010-Learning from Logged Implicit Exploration Data
7 0.24778543 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
8 0.24211764 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
9 0.22462733 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
10 0.18614462 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
11 0.14910035 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
12 0.14545643 43 nips-2010-Bootstrapping Apprenticeship Learning
13 0.12860404 221 nips-2010-Random Projections for $k$-means Clustering
14 0.12253992 212 nips-2010-Predictive State Temporal Difference Learning
15 0.12032703 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
16 0.1111659 269 nips-2010-Throttling Poisson Processes
17 0.10743821 64 nips-2010-Distributionally Robust Markov Decision Processes
18 0.10649352 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
19 0.10133915 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
20 0.09472385 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
topicId topicWeight
[(0, 0.23), (1, -0.407), (2, -0.015), (3, -0.009), (4, 0.03), (5, -0.086), (6, 0.1), (7, 0.01), (8, 0.074), (9, -0.164), (10, 0.031), (11, -0.105), (12, 0.006), (13, 0.043), (14, -0.021), (15, 0.03), (16, 0.129), (17, -0.036), (18, 0.009), (19, 0.002), (20, 0.063), (21, 0.008), (22, 0.012), (23, 0.027), (24, -0.09), (25, 0.038), (26, -0.005), (27, -0.045), (28, -0.031), (29, 0.006), (30, 0.024), (31, -0.076), (32, -0.086), (33, 0.034), (34, 0.108), (35, -0.04), (36, 0.026), (37, 0.038), (38, -0.083), (39, 0.008), (40, 0.035), (41, -0.043), (42, 0.021), (43, -0.055), (44, 0.03), (45, -0.029), (46, 0.096), (47, -0.063), (48, -0.058), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.95748341 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
2 0.85788804 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
Author: Jeffrey Johns, Christopher Painter-wakefield, Ronald Parr
Abstract: Recent work in reinforcement learning has emphasized the power of L1 regularization to perform feature selection and prevent overfitting. We propose formulating the L1 regularized linear fixed point problem as a linear complementarity problem (LCP). This formulation offers several advantages over the LARS-inspired formulation, LARS-TD. The LCP formulation allows the use of efficient off-theshelf solvers, leads to a new uniqueness result, and can be initialized with starting points from similar problems (warm starts). We demonstrate that warm starts, as well as the efficiency of LCP solvers, can speed up policy iteration. Moreover, warm starts permit a form of modified policy iteration that can be used to approximate a “greedy” homotopy path, a generalization of the LARS-TD homotopy path that combines policy evaluation and optimization. 1
3 0.79907465 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
4 0.78556466 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
5 0.78405684 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
6 0.77786195 152 nips-2010-Learning from Logged Implicit Exploration Data
7 0.77088016 208 nips-2010-Policy gradients in linearly-solvable MDPs
8 0.7547313 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
9 0.64328814 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
10 0.62460858 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
11 0.60947269 212 nips-2010-Predictive State Temporal Difference Learning
12 0.60278231 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
13 0.55766422 43 nips-2010-Bootstrapping Apprenticeship Learning
14 0.49911812 64 nips-2010-Distributionally Robust Markov Decision Processes
15 0.47512221 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
16 0.46801615 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
17 0.46092713 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
18 0.41531315 221 nips-2010-Random Projections for $k$-means Clustering
19 0.41502899 269 nips-2010-Throttling Poisson Processes
20 0.39172593 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
topicId topicWeight
[(13, 0.044), (14, 0.308), (27, 0.043), (30, 0.061), (35, 0.019), (39, 0.01), (45, 0.171), (50, 0.034), (51, 0.028), (52, 0.051), (60, 0.048), (77, 0.036), (78, 0.017), (90, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.7308569 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
2 0.69228327 191 nips-2010-On the Theory of Learnining with Privileged Information
Author: Dmitry Pechyony, Vladimir Vapnik
Abstract: In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider an empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization. 1
3 0.55769455 134 nips-2010-LSTD with Random Projections
Author: Mohammad Ghavamzadeh, Alessandro Lazaric, Odalric Maillard, Rémi Munos
Abstract: We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a highdimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm. 1
4 0.55725414 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: Many statistical M -estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension d to grow with (and possibly exceed) the sample size n. This high-dimensional structure precludes the usual global assumptions— namely, strong convexity and smoothness conditions—that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov’s first-order method [12] has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter θ∗ and the optimal solution θ. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of M -estimators and statistical models, including sparse linear regression using Lasso ( 1 regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation. 1
5 0.55632687 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
6 0.55572307 117 nips-2010-Identifying graph-structured activation patterns in networks
7 0.55515856 265 nips-2010-The LASSO risk: asymptotic results and real world examples
8 0.55378735 280 nips-2010-Unsupervised Kernel Dimension Reduction
9 0.55319494 148 nips-2010-Learning Networks of Stochastic Differential Equations
10 0.55312747 63 nips-2010-Distributed Dual Averaging In Networks
11 0.55242354 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
12 0.55208737 243 nips-2010-Smoothness, Low Noise and Fast Rates
13 0.55072582 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning
14 0.55061817 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
15 0.55034304 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
16 0.55006915 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
17 0.54992884 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
18 0.54973817 290 nips-2010-t-logistic regression
19 0.54966128 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
20 0.54899472 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models