nips nips2011 nips2011-36 knowledge-graph by maker-knowledge-mining

36 nips-2011-Analysis and Improvement of Policy Gradient Estimation


Source: pdf

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. [sent-7, score-0.256]

2 In this paper, we analyze and improve the stability of policy gradient methods. [sent-8, score-0.276]

3 We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. [sent-9, score-0.287]

4 We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. [sent-10, score-0.15]

5 We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. [sent-11, score-0.461]

6 1 Introduction The goal of reinforcement learning (RL) is to find an optimal decision-making policy that maximizes the return (i. [sent-13, score-0.263]

7 Policy iteration and policy search are two popular formulations of model-free RL. [sent-17, score-0.18]

8 In the policy iteration approach [6], the value function is first estimated and then policies are determined based on the learned value function. [sent-18, score-0.209]

9 Although policy iteration can naturally deal with continuous states by function approximation [8], continuous actions are hard to handle due to the difficulty of finding maximizers of value functions with respect to actions. [sent-20, score-0.224]

10 Another limitation of policy iteration especially in physical control tasks is that control policies can vary drastically in each iteration. [sent-22, score-0.243]

11 Policy search is another approach to model-free RL that can overcome the limitations of policy iteration [18, 4, 7]. [sent-24, score-0.18]

12 In the policy search approach, control policies are directly learned so that the return is maximized, for example, via a gradient method (called the REINFORCE method) [18], an EM algorithm [4], and a natural gradient method [7]. [sent-25, score-0.41]

13 However, since the REINFORCE method tends to have a large variance in the estimation of the gradient directions, its naive implementation converges slowly [9, 10, 12]. [sent-28, score-0.203]

14 Subtraction of the optimal baseline [16, 5] can ease this problem to some extent, but the variance of gradient estimates is still large. [sent-29, score-0.359]

15 1 To cope with this problem, a novel policy gradient method called policy gradients with parameterbased exploration (PGPE) was proposed recently [12]. [sent-31, score-0.487]

16 In PGPE, an initial policy is drawn from a prior probability distribution, and then actions are chosen deterministically. [sent-32, score-0.195]

17 This construction contributes to mitigating the problem of initial policy choice and stabilizing gradient estimates. [sent-33, score-0.316]

18 Moreover, by subtracting a moving-average baseline, the variance of gradient estimates can be further reduced. [sent-34, score-0.253]

19 More specifically, we first give bounds of the gradient estimates of the REINFORCE and PGPE methods. [sent-37, score-0.152]

20 Our theoretical analysis shows that gradient estimates for PGPE have smaller variance than those for REINFORCE under a mild condition. [sent-38, score-0.251]

21 We then show that the moving-average baseline for PGPE adopted in the original paper [12] has excess variance; we give the optimal baseline for PGPE that minimizes the variance, following the line of [16, 5]. [sent-39, score-0.25]

22 We further theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. [sent-40, score-0.461]

23 2 Policy Gradients for Reinforcement Learning In this section, we review policy gradient methods. [sent-42, score-0.251]

24 Let p(a|s, θ) be a stochastic policy with parameter θ, which represents the conditional probability density of taking action a in state s. [sent-45, score-0.193]

25 The goal of reinforcement learning is to find the optimal policy parameter θ ∗ that maximizes the expected return J(θ): θ ∗ := arg max J(θ). [sent-54, score-0.263]

26 2 Review of the REINFORCE Algorithm In the REINFORCE algorithm [18], the policy parameter θ is updated via gradient ascent: θ ←− θ + ε where ε is a small positive constant. [sent-56, score-0.251]

27 The gradient θ J(θ) = = θ p(h|θ)R(h)dh p(h|θ) T t=1 θ θ J(θ), θ J(θ) = is given by p(h|θ) θ log p(h|θ)R(h)dh log p(at |st , θ)R(h)dh, where we used the so-called ‘log trick’: θ p(h|θ) = p(h|θ) θ log p(h|θ). [sent-57, score-0.145]

28 1 1 T T 2 Let us employ the Gaussian policy model with parameter θ = (µ, σ), where µ is the mean vector and σ is the standard deviation: 1 √ σ 2π p(a|s, θ) = 2 exp − (a−µ2s) 2σ . [sent-62, score-0.163]

29 Then the policy gradients are explicitly given as µ log p(a|s, θ) = a−µ s σ2 s and σ log p(a|s, θ) = (a−µ s)2 −σ 2 . [sent-63, score-0.237]

30 σ3 A drawback of REINFORCE is that the variance of the above policy gradients is large [10, 11], which leads to slow convergence. [sent-64, score-0.295]

31 3 Review of the PGPE Algorithm One of the reasons for large variance of policy gradients in the REINFORCE algorithm is that the empirical average is taken at each time step, which is caused by stochasticity of policies. [sent-66, score-0.306]

32 In order to mitigate this problem, another method called policy gradients with parameter-based exploration (PGPE) was proposed recently [11]. [sent-67, score-0.217]

33 In PGPE, a linear deterministic policy, π(a|s, θ) = θ s, is adopted, and stochasticity is introduced by considering a prior distribution over policy parameter θ with hyper-parameter ρ: p(θ|ρ). [sent-68, score-0.174]

34 Since entire history h is solely determined by a single sample of parameter θ in this formulation, it is expected that the variance of gradient estimates can be reduced. [sent-69, score-0.238]

35 τi3 Variance of Gradient Estimates In this section, we theoretically investigate the variance of gradient estimates in REINFORCE and PGPE. [sent-75, score-0.275]

36 For multi-dimensional state space, we consider the trace of the covariance matrix of gradient vectors. [sent-76, score-0.109]

37 Assumption (C): For δ > 0, there exist two series {ct }T and {dt }T such that st 2 ≥ ct and t=1 t=1 st 2 ≤ dt hold with probability at least (1−δ)1/2N respectively over the choice of sample paths, where · 2 denotes the 2 -norm. [sent-84, score-0.145]

38 t First, we analyze the variance of gradient estimates in PGPE (the proofs of all the theorems are provided in the supplementary material): Theorem 1. [sent-87, score-0.251]

39 The upper bound of the variance of τ J(ρ) is twice larger than that of η J(ρ). [sent-89, score-0.124]

40 Next, we analyze the variance of gradient estimates in REINFORCE: Theorem 2. [sent-91, score-0.251]

41 Under Assumption (A), we have The upper bounds for REINFORCE are similar to those for PGPE, but they are monotone increasing with respect to trajectory length T . [sent-95, score-0.114]

42 The lower bound for the variance of µ J(θ) will be non-trivial if it is positive, i. [sent-96, score-0.107]

43 Deriving a lower bound of the variance of σ J(θ) is left open as future work. [sent-102, score-0.107]

44 Finally, we compare the variance of gradient estimates in REINFORCE and PGPE: Theorem 3. [sent-103, score-0.238]

45 The above theorem means that PGPE is more favorable than REINFORCE in terms of the variance of gradient estimates of the mean, if trajectory length T is large. [sent-106, score-0.319]

46 4 Variance Reduction by Subtracting Baseline In this section, we give a method to reduce the variance of gradient estimates in PGPE and analyze its theoretical properties. [sent-108, score-0.251]

47 The adaptive reinforcement baseline [18] was derived as the exponential moving average of the past experience: b(n) = γR(hn−1 ) + (1 − γ)b(n − 1), where 0 < γ ≤ 1. [sent-111, score-0.153]

48 Based on this, an empirical gradient estimate with the moving-average baseline was proposed for REINFORCE [18] and PGPE [12]. [sent-112, score-0.186]

49 The above moving-average baseline contributes to reducing the variance of gradient estimates. [sent-113, score-0.311]

50 However, it was shown [5, 16] that the moving-average baseline is not optimal; the optimal baseline is, by definition, given as the minimizer of the variance of gradient estimates with respect to a baseline. [sent-114, score-0.467]

51 Following this formulation, the optimal baseline for REINFORCE is given as follows [10]: b∗ REINFORCE := arg minb Var[ θJ b E[R(h) E[ (θ)] = T 2 ] θ log p(at |st ,θ) t=1 T 2] θ log p(at |st ,θ) t=1 . [sent-115, score-0.174]

52 However, only the moving-average baseline was introduced to PGPE so far [12], which is suboptimal. [sent-116, score-0.098]

53 Below, we derive the optimal baseline for PGPE, and study its theoretical properties. [sent-117, score-0.121]

54 2 Optimal Baseline for PGPE Let b∗ PGPE be the optimal baseline for PGPE that minimizes the variance: b∗ PGPE := arg minb Var[ ρJ b (ρ)]. [sent-119, score-0.136]

55 Then the following theorem gives the optimal baseline for PGPE: Theorem 4. [sent-120, score-0.145]

56 The optimal baseline for PGPE is given by b∗ PGPE = E[R(h) E[ 2 ] ρ log p(θ|ρ) log p(θ|ρ) 2 ] , ρ and the excess variance for a baseline b is given by Var[ ρJ b (ρ)] − Var[ ρJ b∗ PGPE (ρ)] = 2 (b−b∗ PGPE ) E[ N ρ log p(θ|ρ) 2 ]. [sent-121, score-0.403]

57 The above theorem gives an analytic-form expression of the optimal baseline for PGPE. [sent-122, score-0.145]

58 When ex2 pected return R(h) and the squared norm of characteristic eligibility are indepenρ log p(θ|ρ) dent of each other, the optimal baseline is reduced to average expected return E[R(h)]. [sent-123, score-0.231]

59 However, the optimal baseline is generally different from the average expected return. [sent-124, score-0.121]

60 The above theorem also 2 shows that the excess variance is proportional to the squared difference of baselines (b − b∗ PGPE ) and the expected squared norm of characteristic eligibility E[ ρ log p(θ|ρ) 2 ], and is inverseproportional to sample size N . [sent-125, score-0.246]

61 Next, we analyze the contribution of the optimal baseline to the variance with respect to mean parameter η in PGPE: Theorem 5. [sent-126, score-0.24]

62 This theorem shows that the lower and upper bounds of the excess variance are proportional to α2 and β 2 (the bounds of squared immediate rewards), B (the trace of the inverse Gaussian covariance), and (1 − γ T )2 /(1 − γ)2 , and are inverse-proportional to sample size N . [sent-129, score-0.222]

63 3 Comparison with REINFORCE Next, we analyze the contribution of the optimal baseline for REINFORCE, and compare it with that for PGPE. [sent-132, score-0.134]

64 It was shown [5, 16] that the excess variance for a baseline b in REINFORCE is given by Var[ θJ b (θ)] − Var[ θJ b∗ REINFORCE (θ)] = 2 (b−b∗ REINFORCE ) E N T t=1 2 θ log p(at |st , θ) . [sent-133, score-0.244]

65 The above theorem shows that the lower and upper bounds of the excess variance are monotone increasing with respect to trajectory length T . [sent-136, score-0.265]

66 In the aspect of the amount of reduction in the variance of gradient estimates, Theorem 5 and Theorem 6 show that the optimal baseline for REINFORCE contributes more than that for PGPE. [sent-137, score-0.334]

67 This theorem shows that the upper bound of the variance of gradient estimates for REINFORCE with the optimal baseline is still monotone increasing with respect to trajectory length T . [sent-140, score-0.498]

68 On the other hand, since (1 − γ T )2 ≤ 1, the above upper bound of the variance of gradient estimates in PGPE 2 ∗ −α2 )B with the optimal baseline can be further upper-bounded as Var[ η J bPGPE (ρ)] ≤ (β (1−γ)2 , which N is independent of T . [sent-141, score-0.387]

69 Thus, when trajectory length T is large, the variance of gradient estimates in REINFORCE with the optimal baseline may be significantly larger than the variance of gradient estimates in PGPE with the optimal baseline. [sent-142, score-0.677]

70 Here, we compare the following five methods: REINFORCE without any baselines, REINFORCE with the optimal baseline (OB), PGPE without any baselines, PGPE with the moving-average baseline (MB), and PGPE with the optimal baseline (OB). [sent-152, score-0.34]

71 We then calculate the variance of gradient estimates over 100 runs. [sent-155, score-0.238]

72 Table 1 summarizes the results, showing that the variance of REINFORCE is overall larger than PGPE. [sent-156, score-0.106]

73 A notable difference between REINFORCE and PGPE is that the variance of REINFORCE 6 Table 1: Variance and bias of estimated gradients for the illustrative data. [sent-157, score-0.165]

74 5 6 8 10 12 14 16 18 20 (a) Good initial policy 16. [sent-180, score-0.183]

75 0779 13 10 20 30 40 50 Iteration (d) REINFORCE-OB and PGPE-OB Figure 1: Variance of gradient estimates with respect to the mean parameter through policy-update iterations for the illustrative data. [sent-209, score-0.171]

76 5 0 2 4 6 8 10 12 14 16 18 20 Number of episodes N (b) Poor initial policy Figure 2: Return as functions of the number of episodic samples N for the illustrative data. [sent-211, score-0.248]

77 The results also show that the variance of PGPEOB (the proposed method) is much smaller than that of PGPE-MB. [sent-214, score-0.096]

78 REINFORCE-OB contributes highly to reducing the variance especially when T is large, which also well agrees with our theory. [sent-215, score-0.134]

79 We also investigate the bias of gradient estimates of each method. [sent-217, score-0.173]

80 We regard gradients estimated with N = 1000 as true gradients, and compute the bias of gradient estimates when N = 100. [sent-218, score-0.192]

81 Next, we investigate the variance of gradient estimates when policy parameters are updated over iterations. [sent-220, score-0.418]

82 In order to evaluate the variance in a stable manner, we repeat the above experiments 20 times with random choice of initial mean parameter µ from [−3. [sent-223, score-0.128]

83 1], and investigate the average variance of gradient estimates with respect to mean parameter µ over 20 trials, in log10 -scale. [sent-225, score-0.265]

84 Figure 1(a) compares the variance of REINFORCE with/without baselines, whereas Figure 1(b) compares the variance of PGPE with/without baselines. [sent-227, score-0.224]

85 These plots show that introduction of baselines contributes highly to the reduction of the variance over iterations. [sent-228, score-0.161]

86 Figure 1(c) compares the variance of REINFORCE and PGPE without baselines, showing that PGPE provides much more stable gradient estimates than REINFORCE. [sent-229, score-0.276]

87 Figure 1(d) compares the variance of REINFORCE and PGPE with the optimal baselines, showing that gradient estimates obtained by PGPE-OB are much smaller than those by REINFORCE-OB. [sent-230, score-0.287]

88 Overall, in terms of the variance of gradient estimates, the proposed PGPE-OB compares favorably with other methods. [sent-231, score-0.2]

89 Plain REINFORCE is highly unstable, which is caused by the huge variance of gradient estimates (see Figure 1 again). [sent-241, score-0.238]

90 A pole is hanged to the roof of a cart, and the goal is to swing up the pole by moving the cart properly and try to keep the pole at the top. [sent-251, score-0.222]

91 We use the Gaussian policy model for REINFORCE and linear policy model for PGPE, where state s is non-linearly transformed to a feature space via a basis function vector. [sent-254, score-0.337]

92 We set the problem parameters as: the mass of the cart W = 8[kg], the mass of the pole w = 2[kg], and the length of the pole l = 0. [sent-261, score-0.172]

93 The initial policy is chosen randomly, and the initial-state probability density is set to be uniform. [sent-268, score-0.183]

94 The return at each trial is computed over 100 test episodic samples (which are not used for policy learning). [sent-273, score-0.229]

95 Conclusion In this paper, we analyzed and improved the stability of the policy gradient method called PGPE (policy gradients with parameterbased exploration). [sent-276, score-0.318]

96 We theoretically showed that, under a mild condition, PGPE provides more stable gradient estimates than the classical REINFORCE method. [sent-277, score-0.187]

97 We also derived the optimal baseline for PGPE, and theoretically showed that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. [sent-278, score-0.582]

98 Finally, we demonstrated the usefulness of PGPE with optimal baseline through experiments. [sent-279, score-0.156]

99 Variance reduction techniques for gradient estimates in reinforcement learning. [sent-323, score-0.187]

100 Approximate gradient methods in policy-space optimization of Markov reward processes. [sent-353, score-0.105]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pgpe', 0.757), ('reinforce', 0.551), ('policy', 0.163), ('var', 0.113), ('baseline', 0.098), ('variance', 0.096), ('gradient', 0.088), ('ob', 0.085), ('pole', 0.057), ('estimates', 0.054), ('reinforcement', 0.045), ('st', 0.044), ('cart', 0.041), ('trajectory', 0.04), ('gradients', 0.036), ('baselines', 0.036), ('mb', 0.035), ('episodic', 0.034), ('ct', 0.032), ('return', 0.032), ('excess', 0.031), ('policies', 0.029), ('contributes', 0.029), ('dhd', 0.028), ('dt', 0.025), ('usefulness', 0.025), ('theorem', 0.024), ('optimal', 0.023), ('plain', 0.022), ('hn', 0.022), ('dh', 0.021), ('theoretically', 0.02), ('initial', 0.02), ('monotone', 0.02), ('illustrative', 0.019), ('log', 0.019), ('ckstiess', 0.019), ('kg', 0.019), ('parameterbased', 0.019), ('sehnke', 0.019), ('tingting', 0.019), ('action', 0.019), ('tends', 0.019), ('rewards', 0.018), ('exploration', 0.018), ('peters', 0.018), ('investigate', 0.017), ('length', 0.017), ('iteration', 0.017), ('upper', 0.017), ('sn', 0.017), ('reward', 0.017), ('gang', 0.016), ('instability', 0.016), ('osendorfer', 0.016), ('stabilizing', 0.016), ('weaver', 0.016), ('compares', 0.016), ('velocity', 0.015), ('preferable', 0.015), ('subtracting', 0.015), ('graves', 0.015), ('hachiya', 0.015), ('minb', 0.015), ('bias', 0.014), ('eligibility', 0.014), ('physical', 0.014), ('angular', 0.013), ('analyze', 0.013), ('rl', 0.013), ('discount', 0.013), ('wl', 0.013), ('mext', 0.013), ('mild', 0.013), ('squared', 0.013), ('sugiyama', 0.012), ('stable', 0.012), ('stability', 0.012), ('actions', 0.012), ('episodes', 0.012), ('sin', 0.012), ('bound', 0.011), ('state', 0.011), ('continuous', 0.011), ('immediate', 0.011), ('indirectly', 0.011), ('balancing', 0.011), ('stochasticity', 0.011), ('respect', 0.01), ('showing', 0.01), ('bounds', 0.01), ('cos', 0.01), ('assumptions', 0.01), ('demonstrated', 0.01), ('unstable', 0.01), ('moving', 0.01), ('trick', 0.01), ('control', 0.01), ('trace', 0.01), ('agrees', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

2 0.14272735 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

3 0.10913777 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

4 0.081858143 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

Author: Jaedeug Choi, Kee-eung Kim

Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1

5 0.076533154 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

6 0.065798827 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

7 0.062752999 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

8 0.052341875 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

9 0.051841442 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

10 0.0517128 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

11 0.051220775 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

12 0.049566649 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

13 0.049252886 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

14 0.047280904 283 nips-2011-The Fixed Points of Off-Policy TD

15 0.044303946 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

16 0.043108057 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

17 0.038716361 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

18 0.038480692 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

19 0.031565353 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

20 0.028082252 72 nips-2011-Distributed Delayed Stochastic Optimization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.076), (1, -0.093), (2, 0.055), (3, 0.088), (4, -0.138), (5, 0.038), (6, 0.014), (7, -0.009), (8, -0.015), (9, 0.016), (10, 0.024), (11, 0.02), (12, -0.026), (13, -0.027), (14, -0.049), (15, -0.005), (16, 0.019), (17, 0.05), (18, 0.003), (19, 0.022), (20, -0.037), (21, -0.056), (22, -0.006), (23, -0.021), (24, -0.031), (25, -0.01), (26, -0.011), (27, 0.036), (28, 0.048), (29, -0.013), (30, 0.042), (31, 0.041), (32, 0.035), (33, -0.023), (34, -0.019), (35, 0.047), (36, 0.047), (37, 0.057), (38, -0.016), (39, 0.02), (40, -0.042), (41, 0.055), (42, 0.038), (43, 0.001), (44, -0.001), (45, 0.012), (46, -0.036), (47, 0.007), (48, -0.01), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93243402 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

2 0.84080189 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

3 0.8220343 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

4 0.76257354 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

5 0.70538086 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

Author: Oliver B. Kroemer, Jan R. Peters

Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1

6 0.6596145 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

7 0.63611233 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

8 0.61907494 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

9 0.6112029 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

10 0.55498159 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

11 0.52642512 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

12 0.49001563 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

13 0.4233585 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

14 0.40574712 283 nips-2011-The Fixed Points of Off-Policy TD

15 0.39061409 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

16 0.38257828 256 nips-2011-Solving Decision Problems with Limited Information

17 0.35258314 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

18 0.30610996 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery

19 0.29528034 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

20 0.29458165 225 nips-2011-Probabilistic amplitude and frequency demodulation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.02), (11, 0.32), (20, 0.014), (26, 0.02), (31, 0.085), (33, 0.018), (43, 0.056), (45, 0.127), (57, 0.048), (74, 0.025), (83, 0.022), (99, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.73722631 73 nips-2011-Divide-and-Conquer Matrix Factorization

Author: Lester W. Mackey, Michael I. Jordan, Ameet Talwalkar

Abstract: This work introduces Divide-Factor-Combine (DFC), a parallel divide-andconquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.

2 0.73001015 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

same-paper 3 0.72011262 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

4 0.66871738 66 nips-2011-Crowdclustering

Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona

Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1

5 0.65587831 150 nips-2011-Learning a Distance Metric from a Network

Author: Blake Shaw, Bert Huang, Tony Jebara

Abstract: Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges. 1

6 0.50012708 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

7 0.49968955 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

8 0.49926811 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

9 0.49734962 271 nips-2011-Statistical Tests for Optimization Efficiency

10 0.49658951 32 nips-2011-An Empirical Evaluation of Thompson Sampling

11 0.49616042 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

12 0.49495515 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

13 0.49486196 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

14 0.49228477 186 nips-2011-Noise Thresholds for Spectral Clustering

15 0.49136919 263 nips-2011-Sparse Manifold Clustering and Embedding

16 0.49050704 291 nips-2011-Transfer from Multiple MDPs

17 0.48924989 283 nips-2011-The Fixed Points of Off-Policy TD

18 0.48918825 180 nips-2011-Multiple Instance Filtering

19 0.48891684 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

20 0.48688012 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time