nips nips2013 nips2013-257 knowledge-graph by maker-knowledge-mining

257 nips-2013-Projected Natural Actor-Critic


Source: pdf

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. [sent-4, score-0.471]

2 In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. [sent-5, score-0.274]

3 We present a principled algorithm for performing natural gradient descent over a constrained domain. [sent-6, score-0.504]

4 In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. [sent-7, score-0.822]

5 While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. [sent-8, score-0.801]

6 1 Introduction Natural actor-critics form a class of policy search algorithms for finding locally optimal policies for Markov decision processes (MDPs) by approximating and ascending the natural gradient [1] of an objective function. [sent-9, score-0.75]

7 A lack of safety guarantees is a common reason for avoiding the use of natural actor-critic algorithms, particularly for biomedical applications. [sent-11, score-0.308]

8 Since natural actor-critics are unconstrained optimization algorithms, there are no guarantees that they will avoid regions of policy space that are known to be dangerous. [sent-12, score-0.59]

9 A desirable property of a policy search algorithm in this context would be a guarantee that it will remain within the predicted region of stable gains during its search. [sent-16, score-0.669]

10 Consider a second example: functional electrical stimulation (FES) control of a human arm. [sent-17, score-0.179]

11 There has been a recent push to develop controllers that specify how much and when to stimulate each muscle in a human arm to move it from its current position to a desired position [5]. [sent-19, score-0.291]

12 This closed-loop control problem is particularly challenging because each person’s arm has different dynamics due to differences in, for example, length, mass, strength, clothing, and amounts of muscle atrophy, spasticity, and fatigue. [sent-20, score-0.23]

13 Hence, a proportional-derivative (PD) controller, tuned to a simulation of an ideal human arm, required manual tuning to obtain desirable performance on a human subject with biceps spasticity [6]. [sent-22, score-0.182]

14 Researchers have shown that policy search algorithms are a viable approach to creating controllers that can automatically adapt to an individual’s arm by training on a few hundred two-second reach1 ing movements [7]. [sent-23, score-0.614]

15 However, safety concerns have been raised in regard to both this specific application and other biomedical applications of policy search algorithms. [sent-24, score-0.614]

16 Specifically, the existing state-of-the-art gradient-based algorithms, including the current natural actor-critic algorithms, are unconstrained and could potentially select dangerous policies. [sent-25, score-0.252]

17 Once again, a desirable property of a policy search algorithm would be a guarantee that it will remain within a specified region of policy space (known-safe policies). [sent-28, score-0.936]

18 In this paper we present a class of natural actor-critic algorithms that perform constrained optimization—given a known safe region of policy space, they search for a locally optimal policy while always remaining within the specified region. [sent-29, score-1.264]

19 We call our class of algorithms Projected Natural Actor-Critics (PNACs) since, whenever they generate a new policy, they project the policy back to the set of safe policies. [sent-30, score-0.566]

20 We show that natural gradient descent (ascent), which is an unconstrained optimization algorithm, is a special case of mirror descent (ascent), which is a constrained optimization algorithm. [sent-32, score-0.896]

21 In order to create a projected natural gradient algorithm, we add constraints in the mirror descent algorithm that is equivalent to natural gradient descent. [sent-33, score-1.026]

22 We apply this projected natural gradient algorithm to policy search to create the PNAC algorithms, which we validate empirically. [sent-34, score-0.836]

23 Bendrahim and Franklin [9] showed how a walking biped robot can switch to a stabilizing controller whenever the robot leaves a stable region of state space. [sent-36, score-0.378]

24 [13] developed a method for performing risk-sensitive policy search, which models the variance of the objective function for each policy and permits runtime adjustments of risk sensitivity. [sent-40, score-0.772]

25 However, their approach does not guarantee that an unsafe region of state space or policy space will be avoided. [sent-41, score-0.548]

26 [14] presented projected natural actor-critic algorithms for the average reward setting. [sent-43, score-0.292]

27 As in our projected natural actor-critic algorithms, they proposed computing the update to the policy parameters and then projecting back to the set of allowed policy parameters. [sent-44, score-1.101]

28 We show in Section 7 that the Euclidean projection can be arbitrarily bad, and argue that the projection that we propose is particularly compatible with natural actor-critics (natural gradient descent). [sent-46, score-0.539]

29 [15] presented mirror descent using the Mahalanobis norm for the proximal function, which is very similar to the proximal function that we show to cause mirror descent to be equivalent to natural gradient descent. [sent-48, score-1.114]

30 However, their proximal function is not identical to ours and they did not discuss any possible relationship between mirror descent and natural gradient descent. [sent-49, score-0.678]

31 The standard gradient descent approach is to select an initial x0 ∈ Rn , compute the direction of steepest descent, −∇f (x0 ), and then move some amount in that direction (scaled by a step size parameter, α0 ). [sent-51, score-0.339]

32 This process is then repeated indefinitely: xk+1 = xk − αk ∇f (xk ), where {αk } is a step size schedule and k ∈ {1, . [sent-52, score-0.269]

33 When computing the direction of steepest descent, gradient descent assumes that the vector xk resides in Euclidean space. [sent-58, score-0.704]

34 However, in several settings it is more appropriate to assume that xk resides in a Riemannian space with metric tensor G(xk ), which is an n × n positive definite matrix that may vary with xk [16]. [sent-59, score-0.664]

35 In this case, the direction of steepest descent is called the natural gradient and is given by −G(xk )−1 ∇f (xk ) [1]. [sent-60, score-0.475]

36 In certain cases, (which include our policy search application), following the natural gradient is asymptotically Fisher-efficient [16]. [sent-61, score-0.721]

37 2 4 Mirror Descent Mirror descent algorithms form a class of highly scalable online gradient methods that are useful in constrained minimization of non-smooth functions [17, 18]. [sent-62, score-0.339]

38 The mirror descent update is ∗ xk+1 = ∇ψk ∇ψk (xk ) − αk ∇f (xk ) , (1) n where ψk : R → R is a continuously differentiable and strongly convex function called the proxi∗ mal function, and where the conjugate of ψk is ψk (y) maxx∈Rn {x y − ψk (x)}, for any y ∈ Rn . [sent-64, score-0.381]

39 Different choices of ψk result in different mirror descent algorithms. [sent-65, score-0.353]

40 Intuitively, the distance metric for the space that xk resides in is not necessarily the same as that of the space that ∇f (xk ) resides in. [sent-67, score-0.461]

41 This suggests that it may not be appropriate to directly add xk and −αk ∇f (xk ) in the gradient descent update. [sent-68, score-0.575]

42 To correct this, mirror descent moves xk into the space of gradients (the dual space) with ∇ψk (xk ) before performing the gradient update. [sent-69, score-0.765]

43 It takes ∗ the result of this step in gradient space and returns it to the space of xk (the primal space) with ∇ψk . [sent-70, score-0.412]

44 Different choices of ψk amount to different assumptions about the relationship between the primal and dual spaces at xk . [sent-71, score-0.269]

45 The natural gradient descent update at step k with metric tensor Gk αk G−1 ∇f (xk ), k xk+1 = xk − is equivalent to (1), the mirror descent update at step k, with ψk (x) = (1/2)x Gk x. [sent-74, score-1.15]

46 Inserting the definitions of ∇ψk (x) and k k ∗ ∇ψk (y) into (1), we find that the mirror descent update is xk+1 =G−1 (Gk xk − αk ∇f (xk )) = xk − αk G−1 ∇f (xk ), k k which is identical to (2). [sent-82, score-0.919]

47 Although researchers often use ψk that are norms like the p-norm and Mahalanobis norm, notice that the ψk that results in natural gradient descent is not a norm. [sent-83, score-0.506]

48 6 Projected Natural Gradients When x is constrained to some set, X, ψk in mirror descent is augmented with the indicator function IX , where IX (x) = 0 if x ∈ X, and +∞ otherwise. [sent-85, score-0.386]

49 The ψk that was shown to generate an update equivalent to the natural gradient descent update, with the added constraint that x ∈ X, is ψk (x) = (1/2)x Gk x + IX (x). [sent-86, score-0.47]

50 Hence, mirror descent with X k the proximal function that produces natural gradient descent, augmented to include the constraint that x ∈ X, is: ˆ xk+1 =ΠGk G−1 (Gk + NX )(xk ) − αk ∇f (xk ) X k ˆ =ΠGk (I + G−1 NX )(xk ) − αk G−1 ∇f (xk ) , X k k ˆ where I denotes the identity relation. [sent-107, score-0.678]

51 Since xk ∈ X, we know that 0 ∈ NX (xk ), and hence the update can be written as xk+1 = ΠGk xk − αk G−1 ∇f (xk ) , X k (6) which we call projected natural gradient (PNG). [sent-108, score-0.96]

52 7 Compatibility of Projection The standard projected subgradient (PSG) descent method follows the negative gradient (as opposed to the negative natural gradient) and projects back to X using the Euclidean norm. [sent-109, score-0.64]

53 This X means that, when the algorithm is at x∗ and a small step is taken down the natural gradient to x , ΠGk will project x back to x∗ . [sent-115, score-0.329]

54 We therefore say that ΠGk is compatible with the natural gradient. [sent-116, score-0.208]

55 Solid arrows show the directions of the natural gradient and gradient at the optimal solution, x∗ . [sent-130, score-0.422]

56 PSG - projected subgradient descent using the Euclidean projection. [sent-133, score-0.311]

57 PNG - projected natural gradient descent using ΠGk . [sent-135, score-0.557]

58 PNG-Euclid - projected natural gradient descent using the Euclidean projection. [sent-137, score-0.557]

59 Also, notice that the natural gradient corrects for the curvature of the function and heads directly towards the global unconstrained minimum. [sent-143, score-0.379]

60 A parameterized policy, π, is a conditional probability density function—π(a|s, θ) is the probability density of action a in state s given a vector of policy parameters, θ ∈ Rn . [sent-147, score-0.415]

61 Given an MDP, M , and a parameterized policy, π, the goal is to find policy parameters that maximize one of these objectives. [sent-149, score-0.386]

62 When the action set is continuous, the search for globally optimal policy parameters becomes intractable, so policy search algorithms typically search for locally optimal policy parameters. [sent-150, score-1.384]

63 All of them form an estimate, typically denoted wk , of the natural gradient of J(θk ). [sent-153, score-0.309]

64 They then perform the policy parameter update, θk+1 = θk +αk wk . [sent-155, score-0.416]

65 9 Projected Natural Actor-Critics If we are given a closed convex set, Θ ⊆ Rn , of admissible policy parameters (e. [sent-156, score-0.386]

66 , the stable region of gains for a PID controller), we may wish to ensure that the policy parameters remain 5 within Θ. [sent-158, score-0.613]

67 However, their policy parameter update equations, which are natural gradient ascent updates, can easily be modified to the projected natural gradient ascent update in (6) by projecting G(θ ) the parameters back onto Θ using ΠΘ k : G(θk ) θk+1 = ΠΘ (θk + αk wk ) . [sent-160, score-1.195]

68 Many of the existing natural policy gradient algorithms, including NAC-LSTD, eNAC, NAC-Sarsa, and INAC, follow biased estimates of the natural policy gradient [29]. [sent-161, score-1.33]

69 For our experiments, we must use an unbiased algorithm since the projection that we propose is compatible with the natural gradient, but not necessarily biased estimates thereof. [sent-162, score-0.302]

70 In our case studies we use the projected natural actorcritic using Sarsa(λ) (PNAC-Sarsa), the projected version of the unbiased NAC-Sarsa algorithm. [sent-168, score-0.366]

71 Notice that we use t and k subscripts since many time steps of the MDP may pass between updates to the policy parameters. [sent-175, score-0.386]

72 10 Case Study: Functional Electrical Stimulation In this case study, we searched for proportional-derivative (PD) gains to control a simulated human arm undergoing FES. [sent-176, score-0.317]

73 We used the Dynamic Arm Simulator 1 (DAS1) [30], a detailed biomechanical simulation of a human arm undergoing functional electrical stimulation. [sent-177, score-0.27]

74 In a previous study, a controller created using DAS1 performed well on an actual human subject undergoing FES, although it required some additional tuning in order to cope with biceps spasticity [6]. [sent-178, score-0.298]

75 The reward function used was similar to that of Jagodnik and van den Bogert [6], which punishes joint angle error and high muscle stimulation. [sent-183, score-0.207]

76 We searched for locally optimal PD gains using PNAC-Sarsa where the policy was a PD controller with Gaussian noise added for exploration. [sent-184, score-0.612]

77 Although DAS1 does not model shoulder dislocation, we added safety constraints by limiting the l1 -norm of certain pairs of gains. [sent-185, score-0.186]

78 Antagonistic muscle pairs are as follows, listed as (flexor, extensor): monoarticular shoulder muscles (a: anterior deltoid, b: posterior deltoid); monoarticular elbow muscles (c: brachialis, d: triceps brachii (short head)); biarticular muscles (e: biceps brachii, f: triceps brachii (long head)). [sent-191, score-0.628]

79 The NAC bar is red to emphasize that the final policy found by NAC resides in the dangerous region of policy space. [sent-194, score-1.024]

80 Since we are not promoting the use of one natural actor-critic over another, we did not focus on finely tuning the natural actor-critic nor comparing the learning speeds of different natural actorcritics. [sent-199, score-0.408]

81 We suspect that PNAC converges to a near-optimal policy within the region of policy space that we have designated as safe. [sent-204, score-0.916]

82 PNAC-E converges to a policy that is worse than that found by PNAC because it uses an incompatible projection. [sent-205, score-0.428]

83 We present a second case study in which the optimal policy lies within the designated safe region of policy space, but where an unconstrained search algorithm may enter the unsafe region during its search of policy space (at which point large negative rewards return it to the safe region). [sent-207, score-1.904]

84 This results in the PD controller having only two gains (tunable policy parameters). [sent-212, score-0.583]

85 We use a crude simulation of the uBot-5 with random upper-body movements, and search for the PD gains that minimize a weighted combination of the energy used and the mean angle error (distance from vertical). [sent-213, score-0.184]

86 We constructed a set of conservative estimates of the region of stable gains, with which the uBot5 should never fall, and used PNAC-Sarsa and NAC-Sarsa to search for the optimal gains. [sent-214, score-0.195]

87 resulting large punishments cause NAC-Sarsa to quickly return to the safe region of policy space. [sent-224, score-0.661]

88 Both algorithms converge to gains that reside within the safe region of policy space. [sent-226, score-0.712]

89 We selected this example because it shows how, even if the optimal solution resides within the safe region of policy space (unlike the in the previous case study), unconstrained RL algorithms may traverse unsafe regions of policy space during their search. [sent-227, score-1.228]

90 12 Conclusion We presented a class of algorithms, which we call projected natural actor-critics (PNACs). [sent-228, score-0.251]

91 PNACs are the simple modification of existing natural actor-critic algorithms to include a projection of newly computed policy parameters back onto an allowed set of policy parameters (e. [sent-229, score-1.052]

92 We argued that a principled projection is the one that results from viewing natural gradient descent, which is an unconstrained algorithm, as a special case of mirror descent, which is a constrained algorithm. [sent-232, score-0.693]

93 We show that the resulting projection is compatible with the natural gradient and gave a simple empirical example that shows why a compatible projection is important. [sent-233, score-0.611]

94 This example also shows how an incompatible projection can result in natural gradient descent converging to a pessimal solution in situations where a compatible projection results in convergence to an optimal solution. [sent-234, score-0.78]

95 Finally, we applied PNAC to a simulated robot and showed its substantial benefits over unconstrained natural actor-critic algorithms. [sent-237, score-0.24]

96 A proportional derivative FES controller for planar arm movement. [sent-275, score-0.218]

97 Application of the actor-critic architecture to functional electrical stimulation control of a human arm. [sent-286, score-0.179]

98 Mirror descent and nonlinear projected subgradient methods for convex optimization. [sent-354, score-0.311]

99 Policy gradient methods for reinforcement learning with function approximation. [sent-386, score-0.205]

100 Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. [sent-392, score-0.341]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gk', 0.45), ('policy', 0.386), ('xk', 0.269), ('nx', 0.204), ('pnac', 0.199), ('mirror', 0.19), ('descent', 0.163), ('gradient', 0.143), ('safety', 0.138), ('natural', 0.136), ('safe', 0.13), ('png', 0.128), ('projected', 0.115), ('controller', 0.109), ('psg', 0.109), ('arm', 0.109), ('region', 0.108), ('pd', 0.102), ('resides', 0.096), ('nac', 0.096), ('projection', 0.094), ('gains', 0.088), ('muscle', 0.082), ('amherst', 0.08), ('fes', 0.072), ('pid', 0.072), ('pnacs', 0.072), ('compatible', 0.072), ('unconstrained', 0.068), ('muscles', 0.064), ('hy', 0.064), ('controllers', 0.063), ('reinforcement', 0.062), ('stimulation', 0.059), ('ix', 0.059), ('search', 0.056), ('biceps', 0.054), ('bogert', 0.054), ('brachii', 0.054), ('inac', 0.054), ('mahadevan', 0.054), ('spasticity', 0.054), ('unsafe', 0.054), ('back', 0.05), ('dangerous', 0.048), ('shoulder', 0.048), ('kuindersma', 0.048), ('proximal', 0.046), ('sarsa', 0.044), ('undergoing', 0.044), ('den', 0.044), ('electrical', 0.044), ('euclidean', 0.042), ('fell', 0.042), ('incompatible', 0.042), ('reward', 0.041), ('angle', 0.04), ('control', 0.039), ('massachusetts', 0.038), ('human', 0.037), ('cause', 0.037), ('designated', 0.036), ('mahalanobis', 0.036), ('bendrahim', 0.036), ('biomechanical', 0.036), ('blana', 0.036), ('deltoid', 0.036), ('dislocation', 0.036), ('enac', 0.036), ('giguere', 0.036), ('jagodnik', 0.036), ('monoarticular', 0.036), ('musculoskeletal', 0.036), ('nacsarsa', 0.036), ('ntd', 0.036), ('pessimal', 0.036), ('triceps', 0.036), ('robot', 0.036), ('episodes', 0.035), ('episode', 0.034), ('biomedical', 0.034), ('robotics', 0.034), ('subgradient', 0.033), ('constrained', 0.033), ('steepest', 0.033), ('notice', 0.032), ('mobility', 0.032), ('researchers', 0.032), ('stable', 0.031), ('rn', 0.03), ('wk', 0.03), ('tensor', 0.03), ('hereafter', 0.029), ('stabilizing', 0.029), ('biped', 0.029), ('principled', 0.029), ('action', 0.029), ('locally', 0.029), ('update', 0.028), ('bhatnagar', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

2 0.3634313 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.33086476 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.24694222 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

5 0.20124066 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan

Abstract: Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1

6 0.19161221 241 nips-2013-Optimizing Instructional Policies

7 0.18856917 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

8 0.17686561 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

9 0.16795233 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

10 0.1639421 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

11 0.1610111 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

12 0.15304202 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

13 0.15025586 318 nips-2013-Structured Learning via Logistic Regression

14 0.15018088 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

15 0.1382262 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

16 0.12547523 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

17 0.12528142 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

18 0.12373927 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

19 0.123478 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

20 0.12052383 165 nips-2013-Learning from Limited Demonstrations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.232), (1, -0.324), (2, -0.129), (3, 0.184), (4, -0.036), (5, 0.015), (6, -0.147), (7, 0.056), (8, 0.076), (9, 0.033), (10, 0.012), (11, 0.065), (12, -0.054), (13, -0.048), (14, 0.028), (15, -0.036), (16, -0.035), (17, -0.011), (18, -0.04), (19, 0.151), (20, 0.068), (21, -0.016), (22, 0.068), (23, 0.016), (24, -0.029), (25, -0.06), (26, 0.09), (27, 0.037), (28, 0.077), (29, 0.047), (30, -0.07), (31, 0.06), (32, 0.016), (33, -0.051), (34, -0.007), (35, 0.029), (36, -0.021), (37, 0.001), (38, 0.03), (39, 0.007), (40, -0.06), (41, 0.06), (42, -0.006), (43, -0.036), (44, 0.024), (45, -0.022), (46, -0.05), (47, -0.004), (48, 0.004), (49, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97514647 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

2 0.88540173 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.87789416 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.75915176 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1

5 0.75267982 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

6 0.74588799 165 nips-2013-Learning from Limited Demonstrations

7 0.72840798 241 nips-2013-Optimizing Instructional Policies

8 0.72572476 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.63457042 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

10 0.62143964 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

11 0.61282218 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

12 0.57507384 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

13 0.56909013 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

14 0.55621129 347 nips-2013-Variational Planning for Graph-based MDPs

15 0.54705775 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

16 0.5099445 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

17 0.50264847 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

18 0.48294389 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

19 0.46556634 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

20 0.43891782 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.023), (33, 0.098), (34, 0.085), (41, 0.487), (49, 0.021), (56, 0.089), (70, 0.031), (85, 0.023), (89, 0.018), (93, 0.027), (95, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87640512 257 nips-2013-Projected Natural Actor-Critic

Author: Philip S. Thomas, William C. Dabney, Stephen Giguere, Sridhar Mahadevan

Abstract: Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural ActorCritics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent. 1

2 0.81790555 189 nips-2013-Message Passing Inference with Chemical Reaction Networks

Author: Nils E. Napp, Ryan P. Adams

Abstract: Recent work on molecular programming has explored new possibilities for computational abstractions with biomolecules, including logic gates, neural networks, and linear systems. In the future such abstractions might enable nanoscale devices that can sense and control the world at a molecular scale. Just as in macroscale robotics, it is critical that such devices can learn about their environment and reason under uncertainty. At this small scale, systems are typically modeled as chemical reaction networks. In this work, we develop a procedure that can take arbitrary probabilistic graphical models, represented as factor graphs over discrete random variables, and compile them into chemical reaction networks that implement inference. In particular, we show that marginalization based on sum-product message passing can be implemented in terms of reactions between chemical species whose concentrations represent probabilities. We show algebraically that the steady state concentration of these species correspond to the marginal distributions of the random variables in the graph and validate the results in simulations. As with standard sum-product inference, this procedure yields exact results for tree-structured graphs, and approximate solutions for loopy graphs.

3 0.77818704 193 nips-2013-Mixed Optimization for Smooth Functions

Author: Mehrdad Mahdavi, Lijun Zhang, Rong Jin

Abstract: It is well known that the optimal convergence rate for stochastic optimization of √ smooth functions is O(1/ T ), which is same as stochastic optimization of Lipschitz continuous convex functions. This is in contrast to optimizing smooth functions using full gradients, which yields a convergence rate of O(1/T 2 ). In this work, we consider a new setup for optimizing smooth functions, termed as Mixed Optimization, which allows to access both a stochastic oracle and a full gradient oracle. Our goal is to significantly improve the convergence rate of stochastic optimization of smooth functions by having an additional small number of accesses to the full gradient oracle. We show that, with an O(ln T ) calls to the full gradient oracle and an O(T ) calls to the stochastic oracle, the proposed mixed optimization algorithm is able to achieve an optimization error of O(1/T ). 1

4 0.74144673 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

Author: Justin Domke, Xianghang Liu

Abstract: Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.

5 0.70176953 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

Author: Jonas Peters, Dominik Janzing, Bernhard Schölkopf

Abstract: Causal inference uses observational data to infer the causal structure of the data generating system. We study a class of restricted Structural Equation Models for time series that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. This work contains two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we provide general identifiability results. They cover lagged and instantaneous effects that can be nonlinear and unfaithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. We show empirically that when the data are causally insufficient or the model is misspecified, the method avoids incorrect answers. We extend the theoretical and the algorithmic part to situations in which the time series have been measured with different time delays. TiMINo is applied to artificial and real data and code is provided. 1

6 0.66739875 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

7 0.65963084 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

8 0.61616898 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval

9 0.59713858 11 nips-2013-A New Convex Relaxation for Tensor Completion

10 0.49601206 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

11 0.48813972 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

12 0.47889191 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

13 0.46983105 184 nips-2013-Marginals-to-Models Reducibility

14 0.46548384 54 nips-2013-Bayesian optimization explains human active search

15 0.44612461 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

16 0.43818393 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

17 0.43799189 318 nips-2013-Structured Learning via Logistic Regression

18 0.43701005 301 nips-2013-Sparse Additive Text Models with Low Rank Background

19 0.43507433 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

20 0.43467176 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching