nips nips2013 nips2013-28 knowledge-graph by maker-knowledge-mining

28 nips-2013-Adaptive Step-Size for Policy Gradient Methods


Source: pdf

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. [sent-19, score-0.909]

2 Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. [sent-21, score-0.466]

3 Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. [sent-25, score-0.571]

4 1 Introduction Policy gradient methods have established as the most effective reinforcement–learning techniques in robotic applications. [sent-27, score-0.255]

5 Such methods perform a policy search to maximize the expected return of a policy in a parameterized policy class. [sent-28, score-2.067]

6 Compared to several traditional reinforcement–learning approaches, policy gradients scale well to high–dimensional continuous state and action problems, and no changes to the algorithms are needed to face uncertainty in the state due to limited and noisy sensors. [sent-30, score-0.869]

7 Furthermore, policy representation can be properly designed for the given task, thus allowing to incorporate domain knowledge into the algorithm useful to speed up the learning process and to prevent the unexpected execution of dangerous policies that may harm the system. [sent-31, score-0.861]

8 Thanks to these advantages, from the 1990s policy gradient methods have been widely used to learn complex control tasks [1]. [sent-33, score-0.927]

9 The research in these years has focused on obtaining good model–free estimators of the policy gradient using data generated during the task execution. [sent-34, score-0.909]

10 The oldest policy gradient approaches are finite–difference methods [2], that estimate gradient direction by resolving a regression problem based on the performance evaluation of policies associated to different small perturbations of the current parameterization. [sent-35, score-1.353]

11 Finite–difference methods have some advantages: they are easy to implement, do not need assumptions on the differentiability of the policy w. [sent-36, score-0.676]

12 the policy parameters, and are efficient in deterministic settings. [sent-39, score-0.676]

13 Such drawbacks have been overcome by likelihood ratio methods [3, 4, 5], since they do not need to generate policy parameters variations and quickly converge even in highly stochastic systems. [sent-42, score-0.676]

14 To further improve the efficiency of policy gradient methods, natural gradient approaches (where the steepest ascent is computed w. [sent-44, score-1.142]

15 Natural gradients still converge to locally optimal policies, are independent from the policy parameterization, need less data to attain good gradient estimate, and are less affected by plateaus. [sent-48, score-0.948]

16 Once an accurate estimate of the gradient direction is obtained, policy parameters are updated by: θ t+1 = θ t + αt θ J θ=θt , where αt ∈ R+ is the step size in the direction of the gradient. [sent-49, score-1.063]

17 Although, given an unbiased gradient estimate, convergence to a local optimum can be guaranteed under mild conditions over the learning–rate values [9], their choice may significantly affect the convergence speed or the behavior during the transient. [sent-50, score-0.313]

18 Updating the policy with large step sizes may lead to policy oscillations or even divergence [10], while trying to avoid such phenomena by using small learning rates determines a growth in the number of iterations that is unbearable in most real–world applications. [sent-51, score-1.481]

19 In general unconstrained programming, the optimal step size for gradient ascent methods is determined through line–search algorithms [11], that require to try different values for the learning rate and evaluate the function value in the corresponding updated points. [sent-52, score-0.31]

20 Such an approach is unfeasible for policy gradient methods, since it would require to perform a large number of policy evaluations. [sent-53, score-1.585]

21 Despite these difficulties, up to now, little attention has been paid to the study of step– size computation for policy gradient algorithms. [sent-54, score-0.967]

22 Nonetheless, some policy search methods based on expectation–maximization have been recently proposed; such methods have properties similar to the ones of policy gradients, but the policy update does not require to tune the step size [12, 13]. [sent-55, score-2.123]

23 In this paper, we propose a new approach to compute the step size in policy gradient methods that guarantees an improvement at each step, thus avoiding oscillation and divergence issues. [sent-56, score-1.091]

24 Starting from a lower bound to the difference of performance between two policies, in Section 3 we derive a lower bound in the case where the new policy is obtained from the old one by changing its parameters along the gradient direction. [sent-57, score-1.124]

25 In Section 4, we show how the bound simplifies to a quadratic function of the step size when Gaussian policies are considered, and Section 5 studies how the bound needs to be changed in approximated settings (e. [sent-61, score-0.394]

26 , model–free case) where the policy gradient needs to be estimated directly from experience. [sent-63, score-0.93]

27 The policy of an agent is characterized by a density distribution π(·|s) that specifies for each state s the density distribution over the action space A. [sent-65, score-0.804]

28 To measure the distance between two policies we will use this norm: π −π ∞ |π (a|s) − π(a|s)|da, = sup s∈S A that is the superior value over the state space of the total variation between the distributions over the action space of policy π and π. [sent-66, score-0.962]

29 For each state s, we define the utility of following a stationary policy π as: ∞ V π (s) = E It is known that V π at ∼ π st ∼ P γ t R(st , at )|s0 = s . [sent-68, score-0.802]

30 Solving an MDP means to find a policy π ∗ that maximizes π the expected long-term reward: π ∗ ∈ arg maxπ∈Π JD . [sent-71, score-0.697]

31 For any MDP there exists at least one deterministic optimal policy that simultaneously maximizes V π (s), ∀s ∈ S. [sent-72, score-0.676]

32 , the value of taking action a in state s and following a policy π thereafter: Qπ (s, a) = R(s, a) + γ π(a |s )Qπ (s , a )da ds . [sent-75, score-0.823]

33 P(s |s, a) S A Furthermore, we define the advantage function: Aπ (s, a) = Qπ (s, a) − V π (s), that quantifies the advantage (or disadvantage) of taking action a in state s instead of following policy π. [sent-76, score-0.785]

34 In particular, for each state s, we define the advantage of a policy π over policy π as Aπ (s) = A π (a|s)Aπ (s, a)da and, following [14], we define its expected value w. [sent-77, score-1.423]

35 π,µ µ π We consider the problem of finding a policy that maximizes the expected discounted reward over a class of parameterized policies Πθ = {πθ : θ ∈ Rm }, where πθ is a compact representation of π(a|s, θ). [sent-81, score-0.961]

36 The exact gradient of the expected discounted reward w. [sent-82, score-0.373]

37 the policy parameters [5] is: 1 πθ dπθ (s) θ π(a|s, θ)Q (s, a)dads. [sent-85, score-0.676]

38 1−γ S µ A The policy parameters can be updated by following the direction of the gradient of the expected discounted reward: θ = θ + α θ Jµ (θ). [sent-86, score-1.015]

39 In the following, we will denote with θ Jµ (θ) 1 and θ Jµ (θ) 2 the L1– and L2–norm of the policy gradient vector, respectively. [sent-87, score-0.909]

40 θ Jµ (θ) 3 = Policy Gradient Formulation In this section we provide a lower bound to the improvement obtained by updating the policy parameters along the gradient direction as a function of the step size. [sent-88, score-1.117]

41 The idea is to start from the general lower bound on the performance difference between any pair of policies introduced in [15] and specialize it to the policy gradient framework. [sent-89, score-1.194]

42 Policy gradient approaches provide search directions characterized by increasing advantage values and, through the step size value, allow to control the difference between the new policy and the target one. [sent-94, score-1.068]

43 Exploiting a lower bound to the first order Taylor’s expansion, we can bound the difference between the current policy and the new policy, whose parameters are adjusted along the gradient direction, as a function of the step size α. [sent-95, score-1.172]

44 Let the update of the policy parameters be θ = θ + α θ Jµ (θ). [sent-98, score-0.676]

45 3 i,j=1 ∂ 2 π(a|s, θ) ∂θi ∂θj ∆θi ∆θj 1 + I(i = j) θ+c∆θ , By combining the two previous lemmas, it is possible to derive the policy performance improvement obtained following the gradient direction. [sent-100, score-0.947]

46 4 The Gaussian Policy Model In this section we consider the Gaussian policy model with fixed standard deviation σ and the mean is a linear combination of the state feature vector φ(·) using a parameter vector θ of size m: π(a|s, θ) = √ 1 2πσ 2 exp − 1 2 a − θ T φ(s) σ 2 . [sent-113, score-0.756]

47 In the case of Gaussian policies, each second–order derivative of policy πθ can be easily bounded. [sent-114, score-0.676]

48 For any Gaussian policy π(a|s, θ) ∼ N (θ T φ(s), σ 2 ), the second order derivative of the policy can be bounded as follows: ∂ 2 π(a|s, θ) |φi (s)φj (s)| ≤ √ , ∂θi ∂θj 2πσ 3 ∀θ ∈ Rm , ∀a ∈ A. [sent-117, score-1.395]

49 For any pair of stationary policies πθ and πθ , so that θ = θ +α norm of their difference can be upper bounded as follows: πθ − πθ ∞ ≤ αMφ σ 4 θ Jµ (θ) 1 . [sent-129, score-0.339]

50 2 into Equation (1) we can obtain a lower bound to the performance difference between a Gaussian policy πθ and another policy along the gradient direction that is quadratic in the step size α. [sent-132, score-1.835]

51 For any starting state distribution µ, and any pair of stationary Gaussian policies T πθ ∼ N (θ T φ(s), σ 2 ) and πθ ∼ N (θ φ(s), σ 2 ), so that θ = θ + α θ Jµ (θ) and under Assumption 4. [sent-135, score-0.293]

52 , the policy gradient and the supremum value of the Q–function). [sent-147, score-0.953]

53 In this section, we study how the step size can be chosen when the gradient is estimated through sample trajectories to guarantee a performance improvement in high probability. [sent-151, score-0.455]

54 3, in order to obtain a bound where the only element that needs to be estimated is the policy gradient θ Jµ (θ). [sent-153, score-0.995]

55 For any starting state distribution µ, and any pair of stationary Gaussian policies T πθ ∼ N (θ T φ(s), σ 2 ) and πθ ∼ N (θ φ(s), σ 2 ), so that θ = θ + α θ Jµ (θ) and under Assumption 4. [sent-156, score-0.293]

56 1 Since we are assuming that the policy gradient θ Jµ (θ) is estimated through trajectory samples, the lower bound in Corollary 5. [sent-158, score-1.046]

57 Given a set of trajectories obtained following policy πθ , we can produce an estimate θ Jµ (θ) of T the policy gradient and we assume to be able to produce a vector = [ 1 , . [sent-160, score-1.69]

58 In particular, to preserve the inequality sign, the estimated approximation error must be used to decrease the L2–norm of the policy gradient in the first term (the one that provides the positive contribution to the performance improvement) and to increase the L1–norm in the penalization term. [sent-166, score-0.954]

59 Similarly, to upper bound the L1–norm of the policy gradient, we introduce the vector θ Jµ (θ): θ Jµ (θ) =| θ Jµ (θ)| + . [sent-168, score-0.769]

60 θ Jµ (θ) 1 In the following, we will discuss how the approximation error of the policy gradient can be bounded. [sent-173, score-0.933]

61 Among the several methods that have been proposed over the years, we focus on two well– understood policy–gradient estimation approaches: REINFORCE [3] and G(PO)MDP [4]/policy gradient theorem (PGT) [5]. [sent-174, score-0.251]

62 1 Approximation with REINFORCE gradient estimator The REINFORCE approach [3] is the main exponent of the likelihood–ratio family. [sent-176, score-0.254]

63 The goal is to determine the number of trajectories N in order to obtain the desired accuracy of the gradient estimate. [sent-179, score-0.319]

64 To achieve this, we exploit the upper bound to the variance of the episodic REINFORCE gradient estimator introduced in [17] for Gaussian policies. [sent-180, score-0.452]

65 Given a Gaussian policy π(a|s, θ) ∼ N θ T φ(s), σ 2 , under the assumption of uniformly bounded rewards and basis functions (Assumption 4. [sent-183, score-0.828]

66 1), we have the following upper bound to the variance of the i–th component of the episodic RF REINFORCE gradient estimate θi Jµ (θ): V ar RF θi Jµ (θ) ≤ 6 2 R 2 Mφ H 1 − γ H N σ 2 (1 − γ) 2 2 . [sent-184, score-0.47]

67 The result in the previous Lemma combined with the Chebyshev’s inequality allows to provide a high–probability upper bound to the gradient approximation error using the episodic REINFORCE gradient estimator. [sent-185, score-0.632]

68 Given a Gaussian policy π(a|s, θ) ∼ N θ T φ(s), σ 2 , under the assumption of uniformly bounded rewards and basis functions (Assumption 4. [sent-188, score-0.828]

69 1), using the following number of H–step trajectories: N= the gradient estimate RF θi Jµ (θ) 2 R 2 Mφ H 1 − γ H δ 2 σ 2 (1 − γ) i , 2 generated by REINFORCE is such that with probability 1 − δ: RF θi Jµ (θ) 5. [sent-189, score-0.252]

70 Approximation with G(PO)MDP/PGT gradient estimator Although the REINFORCE method is guaranteed to converge at the true gradient at the fastest possible pace, its large variance can be problematic in practice. [sent-191, score-0.561]

71 Advances in the likelihood ratio gradient estimators have produced new approaches that significantly reduce the variance of the estimate. [sent-192, score-0.289]

72 Focusing on the class of “vanilla” gradient estimator, two main approaches have been proposed: policy gradient theorem (PGT) [5] and G(PO)MDP [4]. [sent-193, score-1.16]

73 For this look different, their gradient estimate are equal, i. [sent-195, score-0.252]

74 , θ Jµ GT (θ) = θ Jµ reason, we can limit our attention to the PGT formulation: P GT (θ) θ Jµ = 1 N H H H θ n=1 log π (an ; sn , θ) k k k=1 n γ l−1 rl − bn l , l=k where bn ∈ R have the objective to reduce the variance of the gradient estimate. [sent-197, score-0.348]

75 Following the l procedure used to bound the approximation error of REINFORCE, we need an upper bound to the variance of the gradient estimate of PGT that is provided by the following lemma (whose proof is similar to the one used in [17] for the REINFORCE case). [sent-198, score-0.519]

76 Given a Gaussian policy π(a|s, θ) ∼ N θ T φ(s), σ 2 , under the assumption of uniformly bounded rewards and basis functions (Assumption 4. [sent-201, score-0.828]

77 1), we have the following upper bound P to the variance of the i–th component of the PGT gradient estimate θi Jµ GT (θ): V ar P GT (θ) θi J µ ≤ 2 R 2 Mφ 2 N (1 − γ) σ 2 1 − γ 2H 1 − γH + Hγ 2H − 2γ H . [sent-202, score-0.421]

78 1 − γ2 1−γ As expected, since the variance of the gradient estimate obtained with PGT is smaller than the one with REINFORCE, also the upper bound of the PGT variance is smaller than REINFORCE one. [sent-203, score-0.457]

79 Finally, we can derive the upper bound for the approximation error of the gradient estimated of PGT. [sent-205, score-0.371]

80 Given a Gaussian policy π(a|s, θ) ∼ N θ T φ(s), σ 2 , under the assumption of uniformly bounded rewards and basis functions (Assumption 4. [sent-208, score-0.828]

81 1), using the following number of H–step trajectories: N= the gradient estimate 2 R 2 Mφ 2 δ 2 σ 2 (1 − γ) i P GT (θ) θi Jµ 1 − γH 1 − γ 2H + Hγ 2H − 2γ H 1 − γ2 1−γ generated by PGT is such that with probability 1 − δ: P GT (θ) θi Jµ − 7 θi Jµ (θ) ≤ i. [sent-209, score-0.252]

82 The table reports the number of iterations required by the exact gradient approach, starting from θ = 0, to learn the optimal policy parameter θ∗ = −0. [sent-221, score-0.963]

83 The table contains itmax when no convergence happens in 30, 000 iterations, and ⊥ when the algorithm diverges (θ < −1 or θ > 0). [sent-226, score-0.421]

84 The table reports, starting from θ = 0 and fixed σ = 1, the number of iterations performed before the proposed step size α becomes 0 and the last value of the policy parameter. [sent-236, score-0.807]

85 Results are shown for different number of trajectories (of 20 steps each) used in the gradient estimation by REINFORCE and PGT. [sent-237, score-0.319]

86 6 Numerical Simulations and Discussion In this section we show results related to some numerical simulations of policy gradient in the linear–quadratic Gaussian regulation (LQG) problem as formulated in [6]. [sent-238, score-0.909]

87 The LQG problem is characterized by a transition model st+1 ∼ N st + at , σ 2 , Gaussian policy at ∼ N θ · s, σ 2 and quadratic reward rt = −0. [sent-239, score-0.832]

88 Table 1 shows how the number of iterations required to learn a near–optimal value of the policy parameter changes according to the standard deviation of the Gaussian policy and the step–size value. [sent-244, score-1.372]

89 In this example, the higher the policy variance, the lower is the step size value that allows to avoid divergence, since, in LQG, higher policy variance implies larger policy gradient values. [sent-247, score-2.423]

90 4 the policy gradient algorithm avoids divergence (since it guarantees an improvement at each iteration), and the speed of convergence is strongly affected by the variance of the Gaussian policy. [sent-249, score-1.108]

91 In general, when the policy are nearly deterministic (small variance in the Gaussian case), small changes in the parameters lead to large distances between the policies, thus negatively affecting the lower bound in Equation 1. [sent-250, score-0.826]

92 4, considering policies with high variance (that might be a problem in real–world applications) allows to safely take larger step size, thus speeding up the learning process. [sent-252, score-0.248]

93 Nonetheless, increasing the variance over some threshold (making policies nearly random) produces very bad policies, so that changing the policy parameter has a small impact on the performance, and as a result slows down the learning process. [sent-253, score-0.877]

94 Table 2 provides numerical results in the approximated settings, showing the effect of varying the number of trajectories used to estimate the gradient by REINFORCE and PGT. [sent-255, score-0.357]

95 Increasing the number of trajectories reduces the uncertainty on the gradient estimates, thus allowing to use larger step sizes and reaching better performances. [sent-256, score-0.366]

96 Further developments include extending these results to other policy models (e. [sent-263, score-0.676]

97 , Gibbs policies) and to other policy gradient approaches (e. [sent-265, score-0.909]

98 Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. [sent-273, score-0.257]

99 Policy gradient methods for reinforcement learning with function approximation. [sent-284, score-0.282]

100 A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. [sent-300, score-1.379]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.676), ('itmax', 0.402), ('reinforce', 0.329), ('gradient', 0.233), ('pgt', 0.201), ('policies', 0.145), ('trajectories', 0.086), ('lqg', 0.074), ('dads', 0.073), ('bound', 0.065), ('rf', 0.063), ('reward', 0.063), ('mdp', 0.061), ('action', 0.059), ('da', 0.057), ('discounted', 0.056), ('variance', 0.056), ('politecnico', 0.055), ('po', 0.051), ('state', 0.05), ('episodic', 0.049), ('reinforcement', 0.049), ('gt', 0.048), ('step', 0.047), ('stationary', 0.045), ('supremum', 0.044), ('rewards', 0.044), ('bounded', 0.043), ('gaussian', 0.043), ('corollary', 0.042), ('milano', 0.042), ('divergence', 0.04), ('ds', 0.038), ('improvement', 0.038), ('jd', 0.037), ('pirotta', 0.037), ('stefan', 0.035), ('maximized', 0.035), ('starting', 0.034), ('norm', 0.032), ('marcello', 0.032), ('restelli', 0.032), ('horizon', 0.032), ('italy', 0.032), ('sup', 0.032), ('st', 0.031), ('peters', 0.03), ('size', 0.03), ('matteo', 0.03), ('lower', 0.029), ('direction', 0.029), ('lemma', 0.029), ('motor', 0.028), ('paid', 0.028), ('jan', 0.028), ('upper', 0.028), ('difference', 0.027), ('oscillation', 0.027), ('rm', 0.026), ('polynomial', 0.025), ('approximation', 0.024), ('speed', 0.024), ('nonetheless', 0.023), ('quadratic', 0.023), ('assumption', 0.022), ('oscillations', 0.022), ('robotic', 0.022), ('uniformly', 0.022), ('affected', 0.022), ('trajectory', 0.022), ('di', 0.021), ('basis', 0.021), ('estimated', 0.021), ('expected', 0.021), ('estimator', 0.021), ('scenario', 0.021), ('robots', 0.021), ('sutton', 0.021), ('transition', 0.02), ('bn', 0.02), ('iterations', 0.02), ('ar', 0.02), ('characterized', 0.019), ('estimate', 0.019), ('rl', 0.019), ('pair', 0.019), ('convergence', 0.019), ('approximated', 0.019), ('mcallester', 0.019), ('control', 0.018), ('search', 0.018), ('theorem', 0.018), ('perturbations', 0.018), ('guaranteed', 0.018), ('gradients', 0.017), ('discount', 0.017), ('continuous', 0.017), ('lemmas', 0.016), ('dangerous', 0.016), ('daniele', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

2 0.58250296 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.42253816 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

4 0.33795196 241 nips-2013-Optimizing Instructional Policies

Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler

Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1

5 0.33086476 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

6 0.30011025 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

7 0.29305822 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

8 0.29270989 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

9 0.27749822 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

10 0.26846793 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

11 0.25317204 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

12 0.24513212 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

13 0.22266586 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

14 0.19703491 165 nips-2013-Learning from Limited Demonstrations

15 0.19319545 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

16 0.18949383 347 nips-2013-Variational Planning for Graph-based MDPs

17 0.18875444 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

18 0.18264101 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

19 0.18116532 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

20 0.16915143 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.257), (1, -0.514), (2, -0.224), (3, 0.268), (4, -0.034), (5, 0.047), (6, -0.167), (7, 0.136), (8, 0.037), (9, 0.054), (10, 0.03), (11, 0.005), (12, -0.032), (13, 0.006), (14, 0.097), (15, -0.072), (16, 0.011), (17, -0.05), (18, -0.078), (19, 0.058), (20, 0.04), (21, -0.049), (22, 0.015), (23, 0.008), (24, -0.03), (25, -0.025), (26, 0.057), (27, 0.046), (28, 0.01), (29, -0.02), (30, -0.029), (31, -0.05), (32, 0.044), (33, -0.055), (34, 0.02), (35, 0.043), (36, 0.046), (37, -0.024), (38, 0.017), (39, 0.003), (40, 0.019), (41, 0.022), (42, 0.025), (43, 0.006), (44, -0.016), (45, -0.01), (46, 0.016), (47, 0.028), (48, 0.007), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97612524 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

2 0.9747442 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.88185757 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1

4 0.8654983 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1

5 0.86106944 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

6 0.848656 241 nips-2013-Optimizing Instructional Policies

7 0.82957411 348 nips-2013-Variational Policy Search via Trajectory Optimization

8 0.76981831 165 nips-2013-Learning from Limited Demonstrations

9 0.76874828 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

10 0.70664692 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

11 0.69533968 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

12 0.64733672 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

13 0.64606291 347 nips-2013-Variational Planning for Graph-based MDPs

14 0.59994036 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

15 0.56775737 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

16 0.56664169 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.53882051 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

18 0.53868937 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

19 0.50740057 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

20 0.48552358 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.013), (16, 0.024), (33, 0.154), (34, 0.171), (36, 0.015), (41, 0.036), (44, 0.054), (49, 0.022), (56, 0.14), (70, 0.036), (85, 0.048), (89, 0.029), (93, 0.028), (95, 0.118)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96660471 268 nips-2013-Reflection methods for user-friendly submodular optimization

Author: Stefanie Jegelka, Francis Bach, Suvrit Sra

Abstract: Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the benefits of our method on two image segmentation tasks. 1

2 0.95209312 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference

Author: Guy van den Broeck, Adnan Darwiche

Abstract: Lifted inference algorithms exploit symmetries in probabilistic models to speed up inference. They show impressive performance when calculating unconditional probabilities in relational models, but often resort to non-lifted inference when computing conditional probabilities. The reason is that conditioning on evidence breaks many of the model’s symmetries, which can preempt standard lifting techniques. Recent theoretical results show, for example, that conditioning on evidence which corresponds to binary relations is #P-hard, suggesting that no lifting is to be expected in the worst case. In this paper, we balance this negative result by identifying the Boolean rank of the evidence as a key parameter for characterizing the complexity of conditioning in lifted inference. In particular, we show that conditioning on binary evidence with bounded Boolean rank is efficient. This opens up the possibility of approximating evidence by a low-rank Boolean matrix factorization, which we investigate both theoretically and empirically. 1

3 0.93602818 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

Author: Ziteng Wang, Kai Fan, Jiaqi Zhang, Liwei Wang

Abstract: We study differentially private mechanisms for answering smooth queries on databases consisting of data points in Rd . A K-smooth query is specified by a function whose partial derivatives up to order K are all bounded. We develop an -differentially private mechanism which for the class of K-smooth queries has K accuracy O(n− 2d+K / ). The mechanism first outputs a summary of the database. To obtain an answer of a query, the user runs a public evaluation algorithm which contains no information of the database. Outputting the summary runs in time d O(n1+ 2d+K ), and the evaluation algorithm for answering a query runs in time d+2+ 2d K ˜ O(n 2d+K ). Our mechanism is based on L∞ -approximation of (transformed) smooth functions by low degree even trigonometric polynomials with small and efficiently computable coefficients. 1

4 0.93309563 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

Author: Po-Ling Loh, Martin J. Wainwright

Abstract: We establish theoretical results concerning local optima of regularized M estimators, where both loss and penalty functions are allowed to be nonconvex. Our results show that as long as the loss satisfies restricted strong convexity and the penalty satisfies suitable regularity conditions, any local optimum of the composite objective lies within statistical precision of the true parameter vector. Our theory covers a broad class of nonconvex objective functions, including corrected versions of the Lasso for errors-in-variables linear models and regression in generalized linear models using nonconvex regularizers such as SCAD and MCP. On the optimization side, we show that a simple adaptation of composite gradient descent may be used to compute a global optimum up to the statistical precision stat in log(1/ stat ) iterations, the fastest possible rate for any first-order method. We provide simulations to illustrate the sharpness of our theoretical predictions. 1

5 0.93038893 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs

Author: Ying Liu, Alan Willsky

Abstract: Gaussian Graphical Models (GGMs) or Gauss Markov random fields are widely used in many applications, and the trade-off between the modeling capacity and the efficiency of learning and inference has been an important research problem. In this paper, we study the family of GGMs with small feedback vertex sets (FVSs), where an FVS is a set of nodes whose removal breaks all the cycles. Exact inference such as computing the marginal distributions and the partition function has complexity O(k 2 n) using message-passing algorithms, where k is the size of the FVS, and n is the total number of nodes. We propose efficient structure learning algorithms for two cases: 1) All nodes are observed, which is useful in modeling social or flight networks where the FVS nodes often correspond to a small number of highly influential nodes, or hubs, while the rest of the networks is modeled by a tree. Regardless of the maximum degree, without knowing the full graph structure, we can exactly compute the maximum likelihood estimate with complexity O(kn2 + n2 log n) if the FVS is known or in polynomial time if the FVS is unknown but has bounded size. 2) The FVS nodes are latent variables, where structure learning is equivalent to decomposing an inverse covariance matrix (exactly or approximately) into the sum of a tree-structured matrix and a low-rank matrix. By incorporating efficient inference into the learning steps, we can obtain a learning algorithm using alternating low-rank corrections with complexity O(kn2 + n2 log n) per iteration. We perform experiments using both synthetic data as well as real data of flight delays to demonstrate the modeling capacity with FVSs of various sizes. 1

6 0.9206298 184 nips-2013-Marginals-to-Models Reducibility

7 0.91418916 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

8 0.91403055 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

9 0.91383135 309 nips-2013-Statistical Active Learning Algorithms

same-paper 10 0.91128838 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

11 0.90623939 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

12 0.90574467 78 nips-2013-Curvature and Optimal Algorithms for Learning and Minimizing Submodular Functions

13 0.90335572 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited

14 0.9029364 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

15 0.90268487 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

16 0.90190065 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions

17 0.90172237 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

18 0.90122932 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints

19 0.89903396 122 nips-2013-First-order Decomposition Trees

20 0.89817649 249 nips-2013-Polar Operators for Structured Sparse Estimation