nips nips2013 nips2013-348 knowledge-graph by maker-knowledge-mining

348 nips-2013-Variational Policy Search via Trajectory Optimization


Source: pdf

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. [sent-5, score-0.747]

2 We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. [sent-7, score-0.913]

3 A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. [sent-8, score-1.581]

4 1 Introduction Direct policy search methods have the potential to scale gracefully to complex, high-dimensional control tasks [12]. [sent-10, score-0.674]

5 We propose to decouple policy optimization from exploration by using a variational decomposition of a maximum likelihood policy objective. [sent-13, score-1.389]

6 Since direct model-based trajectory optimization is usually much easier than policy search, this method can discover low cost regions much more easily. [sent-15, score-0.884]

7 Intuitively, the trajectory optimization “guides” the policy search toward regions of low cost. [sent-16, score-0.863]

8 The trajectory optimization can be performed by a variant of the differential dynamic programming algorithm [4], and the policy is optimized with respect to a standard maximum likelihood objective. [sent-17, score-0.952]

9 We show that this alternating optimization maximizes a well-defined policy objective, and demonstrate experimentally that it can learn complex tasks in high-dimensional domains that are infeasible for methods that rely on random exploration. [sent-18, score-0.64]

10 2 Preliminaries In standard policy search, we seek to find a distribution over actions ut in each state xt , denoted T πθ (ut |xt ), so as to minimize the sum of expected costs E[c(ζ)] = E[ t=1 c(xt , ut )], where ζ is a sequence of states and actions. [sent-20, score-1.692]

11 The expectation is taken with respect to the system dynamics p(xt+1 |xt , ut ) and the policy πθ (ut |xt ), which is typically parameterized by a vector θ. [sent-21, score-1.046]

12 ” 1 We follow prior work and define the probability of Ot as p(Ot = 1|xt , ut ) ∝ exp(−c(xt , ut )) [19]. [sent-23, score-0.778]

13 Using the dynamics distribution p(xt+1 |xt , ut ) and the policy πθ (ut |xt ), we can define a dynamic Bayesian network that relates states, actions, and the optimality indicator. [sent-24, score-1.078]

14 By setting Ot = 1 at all time steps and learning the maximum likelihood values for θ, we can perform policy optimization [20]. [sent-25, score-0.641]

15 The corresponding optimization problem has the objective T p(O|θ) = p(O|ζ)p(ζ|θ)dζ ∝ exp − T πθ (ut |xt )p(xt+1 |xt , ut )dζ. [sent-26, score-0.461]

16 (1) c(xt , ut ) p(x1 ) t=1 t=1 Although this objective differs from the classical minimum average cost objective, previous work showed that it is nonetheless useful for policy optimization and planning [20, 19]. [sent-27, score-1.092]

17 This is the standard formulation for expectation maximization [9], and has been applied to policy optimization in previous work [8, 21, 3, 11]. [sent-31, score-0.636]

18 However, prior policy optimization methods typically represent q(ζ) by sampling trajectories from the current policy πθ (ut |xt ) and reweighting them, for example by the exponential of their cost. [sent-32, score-1.257]

19 We propose instead to directly optimize q(ζ) to minimize both its expected cost and its divergence from the current policy πθ (ut |xt ) when a model of the dynamics is available. [sent-34, score-0.812]

20 We build off of a variant of DDP called iterative LQR, which linearizes the dynamics around the current trajectory, computes the optimal linear policy under linear-quadratic assumptions, executes this policy, and repeats the process around the new trajectory until convergence [17]. [sent-37, score-0.935]

21 Because of the linear-quadratic assumptions, the value function is always quadratic, and the dynamics are Gaussian with the mean at f (xt , ut ) and noise . [sent-41, score-0.472]

22 uut The deterministic optimal policy is then given by ¯ ¯ g(xt ) = ut + kt + Kt (xt − xt ). [sent-47, score-1.399]

23 ¯ ¯ By repeatedly computing the optimal policy around the current trajectory and updating xt and ut based on the new policy, iterative LQR converges to a locally optimal solution [17]. [sent-48, score-1.518]

24 In order to use this algorithm to minimize the KL divergence in Equation 2, we introduce a modified cost function c(xt , ut ) = c(xt , ut ) − log πθ (ut |xt ). [sent-49, score-0.901]

25 The optimal trajectory for this cost function approximately1 ¯ minimizes the KL divergence when q(ζ) is a Dirac delta function, since T DKL (q(ζ) p(ζ|O, θ)) = c(xt , ut ) − log πθ (ut |xt ) − log p(xt+1 |xt , ut ) dζ + const. [sent-50, score-1.117]

26 The optimal policy πG under this framework minimizes an augmented cost function, given by c(xt , ut ) = c(xt , ut ) − H(πG ), ˜ ¯ where H(πG ) is the entropy of a stochastic policy πG (ut |xt ), and c(xt , ut ) includes log πθ (ut |xt ) ¯ as above. [sent-52, score-2.396]

27 Ziebart [23] showed that the optimal policy can be written as πG (ut |xt ) = exp(−Qt (xt , ut ) + Vt (xt )), where V is a “softened” value function given by Vt (xt ) = log exp (Qt (xt , ut )) dut . [sent-53, score-1.398]

28 1 The minimization is not exact if the dynamics p(xt+1 |xt , ut ) are not deterministic, but the result is very close if the dynamics have much lower entropy than the policy and exponentiated cost, which is often the case. [sent-59, score-1.129]

29 3 Algorithm 1 Variational Guided Policy Search 1: Initialize q(ζ) using DDP with cost c(xt , ut ) = α0 c(xt , ut ) ¯ 2: for iteration k = 1 to K do 3: Compute marginals (µ1 , Σt ), . [sent-60, score-0.903]

30 In practice, the approximation is quite good when the dynamics and the cost c(xt , ut ) are smooth. [sent-64, score-0.529]

31 Unfortunately, the policy term log πθ (ut |xt ) in the modified cost c(xt , ut ) can be quite jagged early on in the optimization, particularly for ¯ nonlinear policies. [sent-65, score-1.074]

32 To mitigate this issue, we compute the derivatives of the policy not only along the current trajectory, but also at samples drawn from the current marginals q(xt ), and average them together. [sent-66, score-0.742]

33 5 Variational Guided Policy Search The variational guided policy search (variational GPS) algorithm alternates between minimizing the KL divergence in Equation 2 with respect to q(ζ) as described in the previous section, and maximizing the bound L(q, θ) with respect to the policy parameters θ. [sent-69, score-1.386]

34 The gradient is given by T L(q, θ) = log πθ (ut |xt )dζ ≈ q(ζ) t=1 1 M M T log πθ (ui |xi ), t t (3) i=1 t=1 where the samples (xi , ui ) are drawn from the marginals q(xt , ut ). [sent-73, score-0.517]

35 When using SGD, new samt t ples can be drawn at every iteration, since sampling from q(xt , ut ) only requires the precomputed marginals from the preceding section. [sent-74, score-0.434]

36 When choosing the policy class, it is common to use deterministic policies with additive Gaussian noise. [sent-79, score-0.652]

37 In this case, we can optimize the policy more quickly and with many fewer samples by only sampling states and evaluating the integral over actions analytically. [sent-80, score-0.68]

38 xt xt xt xt xi xt t 2 2 Two additional details should be taken into account in order to obtain the best results. [sent-82, score-1.45]

39 First, although model-based trajectory optimization is more powerful than random exploration, complex tasks such as bipedal locomotion, which we address in the following section, are too difficult to solve entirely with trajectory optimization. [sent-83, score-0.475]

40 Since our method produces both a parameterized policy πθ (ut |xt ) and a DDP solution πG (ut |xt ), one might wonder why the DDP policy itself is not a suitable controller. [sent-93, score-1.168]

41 This has three major advantages: first, only the learned policy may be usable at runtime if the information available at runtime differs from the information during training, for example if the policy is trained in simulation and executed on a physical system with limited sensors. [sent-96, score-1.171]

42 Second, if the policy class is chosen carefully, we might hope that the learned policy would generalize better than the DDP solution, as shown in previous work [10]. [sent-97, score-1.171]

43 Third, multiple trajectories can be used to train a single policy from different initial states, creating a single controller that can succeed in a variety of situations. [sent-98, score-0.655]

44 For both tasks, the policy sets joint torques on a simulated robot consisting of rigid links. [sent-100, score-0.593]

45 Due to the high dimensionality and nonlinear dynamics, these tasks represent a significant challenge for direct policy learning. [sent-103, score-0.628]

46 The cost function for the walker was given by c(x, u) = wu u 2 + (vx − vx )2 + (py − py )2 , where vx and vx are the current and desired horizontal velocities, py and py are the current and desired heights of the hips, and the torque penalty was set to wu = 10−4 . [sent-104, score-0.446]

47 Following previous work [10], the trajectory for the walker was initialized with a demonstration from a hand-crafted locomotion system [22]. [sent-107, score-0.348]

48 The policy was represented by a neural network with one hidden layer and a soft rectifying nonlinearity of the form a = log(1 + exp(z)), with Gaussian noise at the output. [sent-108, score-0.602]

49 Both the weights of the neural network and the diagonal covariance of the output noise were learned as part of the policy optimization. [sent-109, score-0.597]

50 The number of policy parameters ranged from 63 for the 5-unit swimmer to 246 for the 10-unit walker. [sent-110, score-0.674]

51 Due to its complexity and nonlinearity, this policy class presents a challenge to traditional policy search algorithms, which often focus on compact, linear policies [8]. [sent-111, score-1.261]

52 Methods that sample from the current policy use 10 samples per iteration, unless noted otherwise. [sent-113, score-0.644]

53 The cost was evaluated for both the actual stochastic policy (solid line), and a deterministic policy obtained by setting the variance of the Gaussian noise to zero (dashed line). [sent-115, score-1.225]

54 The average cost of the stochastic policy is shown with a solid line, and the average cost of the deterministic policy without Gaussian noise is shown with a dashed line. [sent-121, score-1.282]

55 The bottom-right panel shows plots of the swimmer and walker, with the center of mass trajectory under the learned policy shown in blue, and the initial DDP solution shown in black. [sent-122, score-0.94]

56 The first method we compare to is guided policy search (GPS), which uses importance sampling to introduce samples from the DDP solution into a likelihood ratio policy search [10]. [sent-123, score-1.411]

57 Like our method, GPS uses DDP to explore regions of low cost, but the policy optimization is done using importance sampling, which can be susceptible to degenerate weights in high dimensions. [sent-125, score-0.636]

58 Since standard GPS only samples from the initial DDP solution, these samples are only useful if they can be reproduced by the policy class. [sent-126, score-0.675]

59 On the easier swimmer task, the GPS policy can reproduce the initial trajectory and succeeds immediately. [sent-128, score-0.897]

60 However, GPS is unable to find a successful walking policy with only 5 hidden units, which requires modifications to the initial trajectory. [sent-129, score-0.665]

61 In addition, although the deterministic GPS policy performs well on the walker with 10 hidden units, the stochastic policy fails more often. [sent-130, score-1.286]

62 However, samples from this adapted DDP solution are then included in the policy optimization with importance sampling, while our approach optimizes the variational bound L(q, θ). [sent-133, score-0.798]

63 Because of this, although the adaptive variant of GPS improves on the standard variant, it is still unable to learn a walking policy with 5 hidden units, while our method quickly discovers an effective policy. [sent-136, score-0.667]

64 DAGGER aims to learn a policy that imitates an oracle [14], which in our case is the DDP solution. [sent-138, score-0.596]

65 At each iteration, DAGGER adds samples from the current policy to a dataset, and then optimizes the policy to take the oracle action at each dataset state. [sent-139, score-1.24]

66 This variant succeeded on the swimming tasks and eventually found a good deterministic policy for the walker with 10 hidden units, though the learned stochastic policy performed very poorly. [sent-142, score-1.396]

67 We also implemented an adapted variant, where the DDP solution is reoptimized at each iteration to match the policy (in addition to weighting), but this variant performed 6 worse. [sent-143, score-0.669]

68 Finally, we compare to an alternative variational policy search algorithm analogous to PoWER [8]. [sent-146, score-0.717]

69 Although PoWER requires a linear policy parameterization and a specific exploration strategy, we can construct an analogous non-linear algorithm by replacing the analytic M-step with nonlinear optimization, as in our method. [sent-147, score-0.71]

70 This algorithm is identical to ours, except that instead of using DDP to optimize q(ζ), the variational distribution is formed by taking samples from the current policy and reweighting them by the exponential of their cost. [sent-148, score-0.753]

71 ” The policy is still initialized with supervised training to resemble the initial DDP solution, but otherwise this method does not benefit from trajectory optimization and relies entirely on random exploration. [sent-150, score-0.839]

72 However, our work is the first to our knowledge to show how trajectory optimization can be used to guide policy learning within the control-as-inference framework. [sent-159, score-0.808]

73 Our variational decomposition follows prior work on policy search with variational inference [3, 11] and expectation maximization [8, 21]. [sent-160, score-0.825]

74 This can be an advantage in applications where running an unstable, incompletely optimized policy can be costly or dangerous. [sent-165, score-0.604]

75 Our use of DDP to guide the policy search parallels our previous Guided Policy Search (GPS) algorithm [10]. [sent-166, score-0.629]

76 These samples are therefore only useful when the policy class can reproduce them effectively. [sent-168, score-0.609]

77 As shown in the evaluation of the walker with 5 hidden units, GPS may be unable to discover a good policy when the policy class cannot reproduce the initial DDP solution. [sent-169, score-1.316]

78 Adaptive GPS addresses this issue by reoptimizing the trajectory to resemble the current policy, but the policy is still optimized with respect to an importance-sampled return estimate, which leaves it highly prone to local optima, and the theoretical justification for adaptation is unclear. [sent-170, score-0.831]

79 The proposed method justifies the reoptimization of the trajectory under a variational framework, and uses standard maximum likelihood in place of the complex importance-sampled objective. [sent-171, score-0.305]

80 We also compared our method to DAGGER [14], which uses a general-purpose supervised training algorithm to train the current policy to match an oracle, which in our case is the DDP solution. [sent-172, score-0.609]

81 DAGGER matches actions from the oracle policy at states visited by the current policy, under the 7 assumption that the oracle can provide good actions in all states. [sent-173, score-0.728]

82 Unlike DAGGER, our approach is relatively insensitive to the instability of the learned policy, since the learned policy is not sampled. [sent-176, score-0.62]

83 Several prior methods also propose to improve policy search by using a distribution over high-value states, which might come from a DDP solution [6, 1]. [sent-177, score-0.649]

84 Such methods generally use this “restart” distribution as a new initial state distribution, and show that optimizing a policy from such a restart distribution also optimizes the expected return. [sent-178, score-0.63]

85 8 Discussion and Future Work We presented a policy search algorithm that employs a variational decomposition of a maximum likelihood objective to combine trajectory optimization with policy search. [sent-180, score-1.58]

86 The variational distribution is obtained using differential dynamic programming (DDP), and the policy can be optimized with a standard nonlinear optimization algorithm. [sent-181, score-0.82]

87 Model-based trajectory optimization effectively takes the place of random exploration, providing a much more effective means for finding low cost regions that the policy is then trained to visit. [sent-182, score-0.865]

88 Our evaluation shows that this algorithm outperforms prior variational methods and prior methods that use trajectory optimization to guide policy search. [sent-183, score-0.896]

89 First, the policy search does not need to sample the learned policy. [sent-185, score-0.652]

90 By sampling directly from the Gaussian marginals of the DDP-induced distribution over trajectories, our approach also avoids some of the issues associated with unstable dynamics, requiring only that the task permit effective trajectory optimization. [sent-188, score-0.286]

91 Obtaining good best-case performance is often the hardest part of policy search, since a policy that achieves good results occasionally is easier to improve with standard on-policy search methods than one that fails outright. [sent-190, score-1.203]

92 While we mitigate this by averaging the policy derivatives over multiple samples from the DDP marginals, this approach could still break down in the presence of highly nonsmooth dynamics or policies. [sent-193, score-0.71]

93 An interesting avenue for future work is to extend the trajectory optimization method to nonsmooth domains by using samples rather than linearization, perhaps analogously to the unscented Kalman filter [5, 18]. [sent-194, score-0.269]

94 This could also avoid the need to differentiate the policy with respect to the inputs, allowing for richer policy classes to be used. [sent-195, score-1.148]

95 Another interesting avenue for future work is to apply model-free trajectory optimization techniques [7], which would avoid the need for a model of the system dynamics, or to learn the dynamics from data, for example by using Gaussian processes [2]. [sent-196, score-0.317]

96 It would also be straightforward to use multiple trajectories optimized from different initial states to learn a single policy that is able to succeed under a variety of initial conditions. [sent-197, score-0.741]

97 Overall, we believe that trajectory optimization is a very useful tool for policy search. [sent-198, score-0.808]

98 By separating the policy optimization and exploration problems into two separate phases, we can employ simpler algorithms such as SGD and DDP that are better suited for each phase, and can achieve superior performance on complex tasks. [sent-199, score-0.702]

99 We believe that additional research into augmenting policy learning with trajectory optimization can further advance the performance of policy search techniques. [sent-200, score-1.437]

100 PILCO: a model-based and data-efficient approach to policy search. [sent-213, score-0.574]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.574), ('ddp', 0.425), ('ut', 0.389), ('xt', 0.29), ('trajectory', 0.192), ('gps', 0.141), ('dagger', 0.132), ('fut', 0.112), ('swimmer', 0.1), ('fxt', 0.099), ('walker', 0.09), ('variational', 0.088), ('exploration', 0.086), ('dynamics', 0.083), ('uut', 0.066), ('locomotion', 0.066), ('kt', 0.06), ('policies', 0.058), ('cost', 0.057), ('search', 0.055), ('guided', 0.053), ('vxxt', 0.05), ('marginals', 0.045), ('kl', 0.045), ('lqr', 0.044), ('divergence', 0.042), ('units', 0.042), ('optimization', 0.042), ('vx', 0.041), ('ot', 0.041), ('vxt', 0.037), ('samples', 0.035), ('current', 0.035), ('todorov', 0.033), ('variant', 0.033), ('walking', 0.032), ('trajectories', 0.032), ('dynamic', 0.032), ('initial', 0.031), ('swimming', 0.03), ('toussaint', 0.03), ('levine', 0.03), ('nonlinear', 0.03), ('optimized', 0.03), ('objective', 0.03), ('unstable', 0.029), ('py', 0.028), ('hidden', 0.028), ('qt', 0.026), ('sgd', 0.025), ('likelihood', 0.025), ('actions', 0.025), ('restart', 0.025), ('bipedal', 0.025), ('cuut', 0.025), ('cuxt', 0.025), ('cxxt', 0.025), ('nonquadratic', 0.025), ('quxt', 0.025), ('qxt', 0.025), ('qxxt', 0.025), ('uxt', 0.025), ('vladlen', 0.025), ('states', 0.025), ('differential', 0.024), ('tasks', 0.024), ('log', 0.024), ('reinforcement', 0.024), ('magnitude', 0.024), ('learned', 0.023), ('iteration', 0.023), ('dkl', 0.023), ('linearized', 0.022), ('oracle', 0.022), ('sergey', 0.022), ('torque', 0.022), ('dut', 0.022), ('qut', 0.022), ('optimize', 0.021), ('control', 0.021), ('solution', 0.02), ('deterministic', 0.02), ('executions', 0.02), ('importance', 0.02), ('task', 0.02), ('maximization', 0.02), ('parameterization', 0.02), ('adapted', 0.019), ('discover', 0.019), ('guides', 0.019), ('cxt', 0.019), ('gaussian', 0.019), ('robot', 0.019), ('equation', 0.018), ('mitigate', 0.018), ('succeed', 0.018), ('lbfgs', 0.018), ('imitation', 0.018), ('iterative', 0.018), ('robotics', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

2 0.46420044 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.42253816 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.31804726 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1

5 0.28625843 241 nips-2013-Optimizing Instructional Policies

Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler

Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1

6 0.28525385 280 nips-2013-Robust Data-Driven Dynamic Programming

7 0.25272939 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

8 0.24694222 257 nips-2013-Projected Natural Actor-Critic

9 0.23839213 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

10 0.22654878 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

11 0.22338587 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

12 0.21067835 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

13 0.18623094 165 nips-2013-Learning from Limited Demonstrations

14 0.17437656 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

15 0.17268811 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

16 0.16417363 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

17 0.15875049 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

18 0.15376025 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

19 0.15108097 347 nips-2013-Variational Planning for Graph-based MDPs

20 0.15090032 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.266), (1, -0.429), (2, -0.18), (3, 0.187), (4, -0.085), (5, 0.075), (6, -0.107), (7, 0.222), (8, 0.061), (9, -0.083), (10, -0.051), (11, -0.181), (12, -0.087), (13, 0.087), (14, -0.013), (15, -0.069), (16, -0.02), (17, -0.004), (18, -0.031), (19, 0.103), (20, 0.074), (21, -0.067), (22, 0.063), (23, 0.011), (24, -0.046), (25, -0.053), (26, 0.04), (27, 0.043), (28, 0.018), (29, -0.035), (30, 0.058), (31, -0.02), (32, 0.098), (33, -0.018), (34, -0.023), (35, 0.001), (36, 0.002), (37, -0.048), (38, 0.01), (39, 0.005), (40, 0.002), (41, 0.025), (42, -0.014), (43, 0.018), (44, 0.014), (45, 0.047), (46, 0.012), (47, -0.014), (48, 0.073), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97945184 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

2 0.84068322 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

3 0.82419717 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

4 0.80984151 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1

5 0.76109415 241 nips-2013-Optimizing Instructional Policies

Author: Robert Lindsey, Michael Mozer, William J. Huggins, Harold Pashler

Abstract: Psychologists are interested in developing instructional policies that boost student learning. An instructional policy specifies the manner and content of instruction. For example, in the domain of concept learning, a policy might specify the nature of exemplars chosen over a training sequence. Traditional psychological studies compare several hand-selected policies, e.g., contrasting a policy that selects only difficult-to-classify exemplars with a policy that gradually progresses over the training sequence from easy exemplars to more difficult (known as fading). We propose an alternative to the traditional methodology in which we define a parameterized space of policies and search this space to identify the optimal policy. For example, in concept learning, policies might be described by a fading function that specifies exemplar difficulty over time. We propose an experimental technique for searching policy spaces using Gaussian process surrogate-based optimization and a generative model of student performance. Instead of evaluating a few experimental conditions each with many human subjects, as the traditional methodology does, our technique evaluates many experimental conditions each with a few subjects. Even though individual subjects provide only a noisy estimate of the population mean, the optimization method allows us to determine the shape of the policy space and to identify the global optimum, and is as efficient in its subject budget as a traditional A-B comparison. We evaluate the method via two behavioral studies, and suggest that the method has broad applicability to optimization problems involving humans outside the educational arena. 1

6 0.75749028 257 nips-2013-Projected Natural Actor-Critic

7 0.75109172 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

8 0.73014772 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.71978706 165 nips-2013-Learning from Limited Demonstrations

10 0.63210213 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

11 0.57715005 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

12 0.57433933 347 nips-2013-Variational Planning for Graph-based MDPs

13 0.56636208 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

14 0.56525952 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

15 0.55231375 280 nips-2013-Robust Data-Driven Dynamic Programming

16 0.54162854 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.50908852 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

18 0.48971832 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

19 0.45247033 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

20 0.44479477 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.014), (16, 0.025), (33, 0.117), (34, 0.221), (41, 0.039), (49, 0.029), (56, 0.127), (70, 0.028), (85, 0.023), (86, 0.197), (89, 0.023), (92, 0.01), (93, 0.042), (95, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85975158 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

2 0.85691279 4 nips-2013-A Comparative Framework for Preconditioned Lasso Algorithms

Author: Fabian L. Wauthier, Nebojsa Jojic, Michael Jordan

Abstract: The Lasso is a cornerstone of modern multivariate data analysis, yet its performance suffers in the common situation in which covariates are correlated. This limitation has led to a growing number of Preconditioned Lasso algorithms that pre-multiply X and y by matrices PX , Py prior to running the standard Lasso. A direct comparison of these and similar Lasso-style algorithms to the original Lasso is difficult because the performance of all of these methods depends critically on an auxiliary penalty parameter λ. In this paper we propose an agnostic framework for comparing Preconditioned Lasso algorithms to the Lasso without having to choose λ. We apply our framework to three Preconditioned Lasso instances and highlight cases when they will outperform the Lasso. Additionally, our theory reveals fragilities of these algorithms to which we provide partial solutions. 1

3 0.83652389 265 nips-2013-Reconciling "priors" & "priors" without prejudice?

Author: Remi Gribonval, Pierre Machart

Abstract: There are two major routes to address linear inverse problems. Whereas regularization-based approaches build estimators as solutions of penalized regression optimization problems, Bayesian estimators rely on the posterior distribution of the unknown, given some assumed family of priors. While these may seem radically different approaches, recent results have shown that, in the context of additive white Gaussian denoising, the Bayesian conditional mean estimator is always the solution of a penalized regression problem. The contribution of this paper is twofold. First, we extend the additive white Gaussian denoising results to general linear inverse problems with colored Gaussian noise. Second, we characterize conditions under which the penalty function associated to the conditional mean estimator can satisfy certain popular properties such as convexity, separability, and smoothness. This sheds light on some tradeoff between computational efficiency and estimation accuracy in sparse regularization, and draws some connections between Bayesian estimation and proximal optimization. 1

4 0.83612055 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

Author: Ben Shababo, Brooks Paige, Ari Pakman, Liam Paninski

Abstract: With the advent of modern stimulation techniques in neuroscience, the opportunity arises to map neuron to neuron connectivity. In this work, we develop a method for efficiently inferring posterior distributions over synaptic strengths in neural microcircuits. The input to our algorithm is data from experiments in which action potentials from putative presynaptic neurons can be evoked while a subthreshold recording is made from a single postsynaptic neuron. We present a realistic statistical model which accounts for the main sources of variability in this experiment and allows for significant prior information about the connectivity and neuronal cell types to be incorporated if available. Due to the technical challenges and sparsity of these systems, it is important to focus experimental time stimulating the neurons whose synaptic strength is most ambiguous, therefore we also develop an online optimal design algorithm for choosing which neurons to stimulate at each trial. 1

5 0.81024587 122 nips-2013-First-order Decomposition Trees

Author: Nima Taghipour, Jesse Davis, Hendrik Blockeel

Abstract: Lifting attempts to speedup probabilistic inference by exploiting symmetries in the model. Exact lifted inference methods, like their propositional counterparts, work by recursively decomposing the model and the problem. In the propositional case, there exist formal structures, such as decomposition trees (dtrees), that represent such a decomposition and allow us to determine the complexity of inference a priori. However, there is currently no equivalent structure nor analogous complexity results for lifted inference. In this paper, we introduce FO-dtrees, which upgrade propositional dtrees to the first-order level. We show how these trees can characterize a lifted inference solution for a probabilistic logical model (in terms of a sequence of lifted operations), and make a theoretical analysis of the complexity of lifted inference in terms of the novel notion of lifted width for the tree. 1

6 0.80945933 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

7 0.8078739 143 nips-2013-Integrated Non-Factorized Variational Inference

8 0.80597633 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

9 0.80540764 210 nips-2013-Noise-Enhanced Associative Memories

10 0.80481708 202 nips-2013-Multiclass Total Variation Clustering

11 0.80206704 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

12 0.79645157 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

13 0.79241562 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

14 0.79201055 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

15 0.79116482 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

16 0.79103184 129 nips-2013-Generalized Random Utility Models with Multiple Types

17 0.78782058 347 nips-2013-Variational Planning for Graph-based MDPs

18 0.78545398 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

19 0.78536737 256 nips-2013-Probabilistic Principal Geodesic Analysis

20 0.78228313 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result