nips nips2013 nips2013-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
Reference: text
sentIndex sentText sentNum sentScore
1 INRIA Lille - Team SequeL Mohammad Ghavamzadeh∗ INRIA Lille - Team SequeL & Adobe Research Abstract In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. [sent-3, score-0.255]
2 Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. [sent-4, score-0.188]
3 In this paper, we consider both discounted and average reward Markov decision processes. [sent-6, score-0.704]
4 For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. [sent-7, score-0.21]
5 We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. [sent-9, score-0.465]
6 Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. [sent-11, score-0.126]
7 1 Introduction The usual optimization criteria for an infinite horizon Markov decision process (MDP) are the expected sum of discounted rewards and the average reward. [sent-12, score-0.581]
8 Many algorithms have been developed to maximize these criteria both when the model of the system is known (planning) and unknown (learning). [sent-13, score-0.149]
9 However in many applications, we may prefer to minimize some measure of risk as well as maximizing a usual optimization criterion. [sent-15, score-0.117]
10 This variability can be due to two types of uncertainties: 1) uncertainties in the model parameters, which is the topic of robust MDPs (e. [sent-17, score-0.117]
11 [22] study stochastic shortest path problems, and in this context, propose a policy gradient algorithm for maximizing several risk-sensitive criteria that involve both the expectation and variance of the return random variable (defined as the sum of rewards received in an episode). [sent-29, score-0.662]
12 1 In this paper, we develop actor-critic algorithms for optimizing variance-related risk measures in both discounted and average reward MDPs. [sent-31, score-0.756]
13 Our contributions can be summarized as follows: • In the discounted reward setting we define the measure of variability as the variance of the return (similar to [22]). [sent-32, score-0.796]
14 We employ the Lagrangian relaxation procedure [1] and derive a formula for the gradient of the Lagrangian. [sent-34, score-0.127]
15 1 • In the average reward formulation, we first define the measure of variability as the long-run variance of a policy, and using a constrained optimization problem similar to the discounted case, derive an expression for the gradient of the Lagrangian. [sent-36, score-0.914]
16 We then develop an actor-critic algorithm with compatible features [21, 13] to estimate the gradient and to optimize the policy parameters. [sent-37, score-0.429]
17 Further, we demonstrate the usefulness of our algorithms in a traffic signal control problem. [sent-39, score-0.126]
18 In comparison to [22], which is the closest related work, we would like to remark that while the authors there develop policy gradient methods for stochastic shortest path problems, we devise actorcritic algorithms for both discounted and average reward settings. [sent-40, score-1.154]
19 Moreover, we note the difficulty in the discounted formulation that requires to estimate the gradient of the value function at every state of the MDP, and thus, motivated us to employ simultaneous perturbation techniques. [sent-41, score-0.702]
20 , m} are the state and action spaces; R(x, a) is the reward random variable whose expectation is denoted by r(x, a) = E R(x, a) ; P (·|x, a) is the transition probability distribution; and P0 (·) is the initial state distribution. [sent-49, score-0.381]
21 A stationary policy µ(·|x) is a probability distribution over actions, conditioned on the current state. [sent-51, score-0.3]
22 In policy gradient and actor-critic methods, we define a class of parameterized stochastic policies µ(·|x; θ), x ∈ X , θ ∈ Θ ⊆ Rκ1 , estimate the gradient of a performance measure w. [sent-52, score-0.583]
23 the policy parameters θ from the observed system trajectories, and then improve the policy by adjusting its parameters in the direction of the gradient. [sent-55, score-0.623]
24 Since in this setting a policy µ is represented by its κ1 -dimensional parameter vector θ, policy dependent functions can be written as a function of θ in place of µ. [sent-56, score-0.6]
25 We denote by dµ (x) and π µ (x, a) = dµ (x)µ(a|x) the stationary distribution of state x and stateaction pair (x, a) under policy µ, respectively. [sent-58, score-0.382]
26 In the discounted formulation, we also define the discounted visiting distribution of state x and state-action pair (x, a) under policy µ as dµ (x|x0 ) = γ ∞ µ (1 − γ) t=0 γ t Pr(xt = x|x0 = x0 ; µ) and πγ (x, a|x0 ) = dµ (x|x0 )µ(a|x). [sent-59, score-1.029]
27 γ 3 Discounted Reward Setting For a given policy µ, we define the return of a state x (state-action pair (x, a)) as the sum of discounted rewards encountered by the agent when it starts at state x (state-action pair (x, a)) and then follows policy µ, i. [sent-60, score-1.143]
28 t=0 The expected value of these two random variables are the value and action-value functions of policy µ, i. [sent-63, score-0.3]
29 The goal in the standard discounted reward formulation is to find an optimal policy µ∗ = arg maxµ V µ (x0 ), where x0 is the initial state of the system. [sent-66, score-0.98]
30 x∈X 1 We note here that our algorithms can be easily extended to other variance-related risk criteria such as the Sharpe ratio, which is popular in financial decision-making [18] (see Appendix D of [17]). [sent-68, score-0.188]
31 2 The most common measure of the variability in the stream of rewards is the variance of the return Λµ (x) = E Dµ (x)2 − V µ (x)2 = U µ (x) − V µ (x)2 , (1) first introduced by Sobel [19]. [sent-69, score-0.241]
32 Note that U µ (x) = E Dµ (x)2 is the square reward value function of state x under policy µ. [sent-70, score-0.677]
33 Although Λµ of (1) satisfies a Bellman equation, unfortunately, it lacks the monotonicity property of dynamic programming (DP), and thus, it is not clear how the related risk measures can be optimized by standard DP algorithms [19]. [sent-71, score-0.098]
34 This is why policy gradient and actor-critic algorithms are good candidates to deal with this risk measure. [sent-72, score-0.496]
35 We consider the following risk-sensitive measure for discounted MDPs: for a given α > 0, max V θ (x0 ) subject to θ Λθ (x0 ) ≤ α. [sent-73, score-0.375]
36 From the Bellman equation of Λµ (x), proposed by Sobel [19], it is straightforward to derive Bellman equations for U µ (x) and the square reward action-value function W µ (x, a) = E Dµ (x, a)2 (see Appendix B. [sent-80, score-0.332]
37 Using these definitions and notations we are now ready to derive expressions for the gradient of V θ (x0 ) and U θ (x0 ) that are the main ingredients in calculating θ L(θ, λ). [sent-82, score-0.098]
38 It is challenging to devise an efficient method to estimate θ L(θ, λ) using the gradient formulas of Lemma 1. [sent-86, score-0.129]
39 This is mainly θ θ because 1) two different sampling distributions (πγ and πγ ) are used for V θ (x0 ) and U θ (x0 ), θ θ 0 and 2) V (x ) appears in the second sum of U (x ) equation, which implies that we need to estimate the gradient of the value function V θ at every state of the MDP. [sent-87, score-0.143]
40 These are the main motivations behind using simultaneous perturbation methods for estimating θ L(θ, λ) in Section 4. [sent-88, score-0.159]
41 4 Discounted Reward Algorithms In this section, we present actor-critic algorithms for optimizing the risk-sensitive measure (2) that are based on two simultaneous perturbation methods: simultaneous perturbation stochastic approximation (SPSA) and smoothed functional (SF) [3]. [sent-89, score-0.418]
42 In our actor-critic algorithms, the critic uses linear approximation for the value and square value functions, i. [sent-94, score-0.383]
43 SPSA-based gradient estimates were first proposed in [20] and have been widely studied and found to be highly efficient in various settings, especially those involving high-dimensional parameters. [sent-97, score-0.13]
44 However, unlike the SPSA estimates in [20] that use two-sided balanced estimates (simulations with parameters θ −β∆ and θ +β∆), our gradient estimates are one-sided (simulations with parameters θ and θ+β∆) and resemble those in [6]. [sent-105, score-0.194]
45 Using a balanced gradient estimate would therefore come at the cost of an additional simulation (the resulting procedure would then require three simulations), which we avoid by using one-sided gradient estimates. [sent-107, score-0.196]
46 SF-based method estimates not the gradient of a function H(θ) itself, but rather the convolution of H(θ) with the Gaussian density function N (0, β 2 I), i. [sent-108, score-0.13]
47 The overall flow of our proposed actor-critic algorithms is illustrated in Figure 1 and involves the following main steps at each time step t: (1) Take action at ∼ µ(·|xt ; θt ), observe the reward r(xt , at ) and next state xt+1 in the first trajectory. [sent-117, score-0.372]
48 + (2) Take action a+ ∼ µ(·|x+ ; θt ), observe the reward r(x+ , a+ ) and next state x+ in the second t t t t t+1 trajectory. [sent-118, score-0.336]
49 t t t t t t t t+1 t+1 4 (7) This TD algorithm to learn the value and square value functions is a straightforward extension of the algorithm proposed in [23] to the discounted setting. [sent-120, score-0.41]
50 (4) Actor Update: Estimate the gradients V θ (x0 ) and U θ (x0 ) using SPSA (4) or SF (5) and update the policy parameter θ and the Lagrange multiplier λ as follows: For i = 1, . [sent-123, score-0.399]
51 A proof of convergence of the SPSA and SF algorithms to a (local) saddle point of the risk-sensitive objective function L(θ, λ) = −V θ (x0 ) + λ(Λθ (x0 ) − α) is given in Appendix B. [sent-129, score-0.104]
52 5 Average Reward Setting The average reward per step under policy µ is defined as (see Sec. [sent-131, score-0.616]
53 = t=0 x,a The goal in the standard (risk-neutral) average reward formulation is to find an average optimal policy, i. [sent-133, score-0.397]
54 Here a policy µ is assessed according to the expected differential reward associated with states or state-action pairs. [sent-136, score-0.613]
55 For all states x ∈ X and actions a ∈ A, the differential action-value and value functions of policy µ are defined as ∞ Qµ (x, a) = V µ (x) = E Rt − ρ(µ) | x0 = x, a0 = a, µ , t=0 µ(a|x)Qµ (x, a). [sent-137, score-0.377]
56 a In the context of risk-sensitive MDPs, different criteria have been proposed to define a measure of variability, among which we consider the long-run variance of µ [8] defined as π µ (x, a) r(x, a) − ρ(µ) Λ(µ) = 2 = x,a lim T →∞ 1 E T T −1 Rt − ρ(µ) 2 |µ . [sent-138, score-0.161]
57 (11) t=0 This notion of variability is based on the observation that it is the frequency of occurrence of stateaction pairs that determine the variability in the average reward. [sent-139, score-0.263]
58 where η(µ) = x,a We consider the following risk-sensitive measure for average reward MDPs in this paper: max ρ(θ) θ subject to Λ(θ) ≤ α, (12) for a given α > 0. [sent-141, score-0.349]
59 As in the discounted setting, we employ the Lagrangian relaxation procedure to convert (12) to the unconstrained problem max min L(θ, λ) = −ρ(θ) + λ Λ(θ) − α λ θ . [sent-142, score-0.371]
60 Similar to the discounted case, we descend in θ using θ L(θ, λ) = − θ ρ(θ) + λ θ Λ(θ) and ascend in λ using λ L(θ, λ) = Λ(θ) − α, to find the saddle point of L(θ, λ). [sent-143, score-0.41]
61 Let U µ and W µ denote the differential value and action-value functions associated with the square reward under policy µ, respectively. [sent-145, score-0.681]
62 We define the temporal difference (TD) errors δt and t for the differential value and square value functions as δt = R(xt , at ) − ρt+1 + V (xt+1 ) − V (xt ), t = R(xt , at )2 − ηt+1 + U (xt+1 ) − U (xt ). [sent-153, score-0.117]
63 , E[ δt | xt , at , µ] = Aµ (xt , at ), and E[ t | xt , at , µ] = B µ (xt , at ) (see Lemma 6 in Appendix C. [sent-156, score-0.514]
64 6 Average Reward Algorithm We now present our risk-sensitive actor-critic algorithm for average reward MDPs. [sent-161, score-0.316]
65 Algorithm 1 presents the complete structure of the algorithm along with update rules for the average rewards ρt , ηt ; TD errors δt , t ; critic vt , ut ; and actor θt , λt parameters. [sent-162, score-0.975]
66 The projection operators Γ and Γλ are as defined in Section 4, and similar to the discounted setting, are necessary for the convergence proof of the algorithm. [sent-163, score-0.342]
67 As in the discounted setting, the critic uses linear approximation for the differential value and square value functions, i. [sent-166, score-0.774]
68 Although our estimates of ρ(θ) and η(θ) are unbiased, since we use biased estimates for V θ and U θ (linear approximations in the critic), our gradient estimates ρ(θ) and η(θ), and as a result L(θ, λ), are biased. [sent-169, score-0.194]
69 7 Experimental Results We evaluate our algorithms in the context of a traffic signal control application. [sent-174, score-0.126]
70 The objective in our formulation is to minimize the total number of vehicles in the system, which indirectly minimizes the delay experienced by the system. [sent-175, score-0.112]
71 The motivation behind using a risk-sensitive control strategy is to reduce the variations in the delay experienced by road users. [sent-176, score-0.161]
72 We consider both infinite horizon discounted as well average settings for the traffic signal control MDP, formulated as in [15]. [sent-177, score-0.484]
73 Here qi and ti denote the queue length and elapsed time since the signal turned to red on lane i. [sent-188, score-0.205]
74 The single-stage cost function h(xt ) is defined as follows: r2 · qi (t) + h(xt ) = r1 i∈Ip s2 · qi (t) + s1 r2 · ti (t) + i∈Ip i∈Ip / s2 · ti (t) , (19) i∈Ip / where ri , si ≥ 0 such that ri + si = 1 for i = 1, 2 and r2 > s2 . [sent-190, score-0.104]
75 Given the above traffic control setting, we aim to minimize both the long run discounted as well average sum of the cost function h(xt ). [sent-193, score-0.449]
76 The underlying policy for all the algorithms is a parameterized Boltzmann policy (see Appendix F of [17]). [sent-194, score-0.636]
77 Note that these are two-timescale algorithms with a TD critic on the faster timescale and the actor on the slower timescale. [sent-196, score-0.534]
78 (ii) Risk-sensitive SPSA and SF algorithms (RS-SPSA and RS-SF) of Section 4 that attempt to solve (2) and update the policy parameter according to (8) and (9), respectively. [sent-197, score-0.368]
79 In the average setting, we implement (i) the risk-neutral AC algorithm from [14] that incorporates an actor-critic scheme, and (ii) the risk-sensitive algorithm of Section 6 (RS-AC) that attempts to solve (12) and updates the policy parameter according to (17). [sent-198, score-0.39]
80 For instance, assuming only 20 vehicles per lane of a 2x2-grid network, the cardinality of the state space is approximately of the order 1032 and the situation is aggravated as the size of the road network increases. [sent-200, score-0.158]
81 Figures 2(a) and 2(b) show the distribution of the discounted cumulative reward Dθ (x0 ) for the SPSA and SF algorithms, respectively. [sent-204, score-0.606]
82 Figure 3(a) shows the distribution of the average reward ρ for the algorithms in the average setting. [sent-205, score-0.404]
83 RS-SF Figure 2: Performance comparison in the discounted setting using the distribution of Dθ (x0 ). [sent-226, score-0.342]
84 that we propose result in a long-term (discounted or average) reward that is higher than their riskneutral variants. [sent-239, score-0.264]
85 However, from the empirical variance of the reward (both discounted as well as average) perspective, the risk-sensitive algorithms outperform their risk-neutral variants. [sent-240, score-0.68]
86 We use average junction waiting time (AJWT) to compare the algorithms from a traffic signal control application standpoint. [sent-241, score-0.232]
87 Figure 3(b) presents the AJWT plots for the algorithms in the average setting (see Appendix F of [17] for similar results for the SPSA and SF algorithms in the discounted setting). [sent-242, score-0.466]
88 8 Conclusions and Future Work We proposed novel actor critic algorithms for control in risk-sensitive discounted and average reward MDPs. [sent-245, score-1.201]
89 All our algorithms involve a TD critic on the fast timescale, a policy gradient (actor) on the intermediate timescale, and dual ascent for Lagrange multipliers on the slowest timescale. [sent-246, score-0.795]
90 In the discounted setting, we pointed out the difficultly in estimating the gradient of the variance of the return and incorporated simultaneous perturbation based SPSA and SF approaches for gradient estimation in our algorithms. [sent-247, score-0.767]
91 The average setting, on the other hand, allowed for an actor to employ compatible features to estimate the gradient of the variance. [sent-248, score-0.347]
92 Further, using a traffic signal control application, we observed that our algorithms resulted in lower variance empirically as compared to their risk-neutral counterparts. [sent-250, score-0.164]
93 In this paper, we established asymptotic limits for our discounted and average reward risk-sensitive actor-critic algorithms. [sent-251, score-0.658]
94 This is true even for the actor-critic algorithms that do not incorporate any risk criterion. [sent-253, score-0.098]
95 Percentile performance criteria for limiting average Markov decision processes. [sent-302, score-0.188]
96 Robust control of Markov decision processes with uncertain transition matrices. [sent-317, score-0.101]
97 Reinforcement learning with average cost for adaptive control of traffic lights at intersections. [sent-329, score-0.107]
98 Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. [sent-361, score-0.288]
99 Policy gradient methods for reinforcement learning with function approximation. [sent-368, score-0.13]
100 Temporal difference methods for the variance of the reward to go. [sent-380, score-0.302]
wordName wordTfidf (topN-words)
[('spsa', 0.357), ('discounted', 0.342), ('critic', 0.315), ('policy', 0.3), ('sf', 0.273), ('reward', 0.264), ('vt', 0.262), ('xt', 0.257), ('actor', 0.137), ('ut', 0.126), ('traf', 0.116), ('prashanth', 0.105), ('gradient', 0.098), ('td', 0.092), ('criteria', 0.09), ('variability', 0.087), ('perturbation', 0.084), ('tamar', 0.084), ('simultaneous', 0.075), ('mdps', 0.073), ('appendix', 0.073), ('saddle', 0.068), ('square', 0.068), ('ajwt', 0.063), ('risk', 0.062), ('lagrange', 0.055), ('control', 0.055), ('road', 0.053), ('average', 0.052), ('rewards', 0.051), ('mdp', 0.05), ('differential', 0.049), ('ac', 0.048), ('ip', 0.048), ('schedules', 0.048), ('decision', 0.046), ('queue', 0.046), ('timescale', 0.046), ('slowest', 0.046), ('lagrangian', 0.045), ('state', 0.045), ('bellman', 0.044), ('percentile', 0.044), ('inria', 0.043), ('elapsed', 0.042), ('lille', 0.042), ('filar', 0.042), ('rhs', 0.041), ('ghavamzadeh', 0.039), ('dz', 0.038), ('variance', 0.038), ('updates', 0.038), ('stateaction', 0.037), ('team', 0.036), ('operations', 0.036), ('algorithms', 0.036), ('rt', 0.036), ('signal', 0.035), ('sobel', 0.034), ('multiplier', 0.034), ('rademacher', 0.034), ('measure', 0.033), ('gradients', 0.033), ('estimates', 0.032), ('update', 0.032), ('return', 0.032), ('adobe', 0.032), ('bhatnagar', 0.032), ('reinforcement', 0.032), ('markov', 0.032), ('stochastic', 0.031), ('compatible', 0.031), ('devise', 0.031), ('lane', 0.03), ('vehicles', 0.03), ('experienced', 0.03), ('uncertainties', 0.03), ('automatic', 0.03), ('formulation', 0.029), ('castro', 0.029), ('mohammad', 0.029), ('employ', 0.029), ('qi', 0.029), ('actions', 0.028), ('waiting', 0.028), ('agent', 0.028), ('calculate', 0.027), ('nance', 0.027), ('action', 0.027), ('junction', 0.026), ('unbiased', 0.026), ('transportation', 0.025), ('sutton', 0.024), ('ti', 0.023), ('system', 0.023), ('policies', 0.023), ('delay', 0.023), ('maximizing', 0.022), ('lemma', 0.022), ('rl', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
2 0.31804726 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
3 0.30011025 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.27122521 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
5 0.24352929 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
6 0.21707577 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
7 0.18533495 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
8 0.17686561 257 nips-2013-Projected Natural Actor-Critic
9 0.17622198 347 nips-2013-Variational Planning for Graph-based MDPs
10 0.1708028 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
11 0.16440254 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
12 0.16212027 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
13 0.15639868 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
14 0.15577057 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
15 0.15474983 241 nips-2013-Optimizing Instructional Policies
16 0.14712778 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
17 0.14655691 280 nips-2013-Robust Data-Driven Dynamic Programming
18 0.14566089 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
19 0.1367517 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
20 0.13282834 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
topicId topicWeight
[(0, 0.251), (1, -0.367), (2, -0.089), (3, 0.122), (4, -0.062), (5, 0.022), (6, -0.046), (7, 0.135), (8, -0.002), (9, -0.048), (10, -0.021), (11, -0.119), (12, -0.024), (13, 0.016), (14, -0.069), (15, 0.051), (16, 0.061), (17, 0.051), (18, 0.01), (19, 0.05), (20, 0.073), (21, -0.003), (22, 0.058), (23, -0.024), (24, -0.039), (25, -0.026), (26, 0.021), (27, -0.036), (28, -0.023), (29, -0.026), (30, 0.044), (31, -0.029), (32, -0.063), (33, 0.088), (34, -0.028), (35, -0.039), (36, 0.015), (37, 0.043), (38, -0.033), (39, 0.006), (40, 0.058), (41, -0.002), (42, -0.013), (43, 0.108), (44, -0.088), (45, 0.076), (46, -0.013), (47, 0.083), (48, 0.117), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.96688163 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
2 0.83067459 348 nips-2013-Variational Policy Search via Trajectory Optimization
Author: Sergey Levine, Vladlen Koltun
Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1
3 0.73312694 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.68845886 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
6 0.67508823 347 nips-2013-Variational Planning for Graph-based MDPs
7 0.64307648 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
8 0.63840586 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
9 0.63530755 257 nips-2013-Projected Natural Actor-Critic
10 0.63066572 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
11 0.62846619 280 nips-2013-Robust Data-Driven Dynamic Programming
12 0.60969573 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
13 0.60614485 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
14 0.60073286 241 nips-2013-Optimizing Instructional Policies
15 0.59243405 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
16 0.57005733 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
17 0.56950176 165 nips-2013-Learning from Limited Demonstrations
18 0.56399047 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
19 0.55068707 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
20 0.53080583 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
topicId topicWeight
[(16, 0.024), (33, 0.092), (34, 0.128), (36, 0.273), (41, 0.039), (49, 0.016), (56, 0.177), (70, 0.032), (85, 0.036), (89, 0.034), (93, 0.059)]
simIndex simValue paperId paperTitle
1 0.82209241 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
same-paper 2 0.81802678 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
3 0.7913087 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
4 0.76282156 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
Author: Karin C. Knudson, Jonathan W. Pillow
Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1
5 0.73196167 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
6 0.68841124 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
7 0.68322384 249 nips-2013-Polar Operators for Structured Sparse Estimation
8 0.68296766 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
9 0.66560537 25 nips-2013-Adaptive Anonymity via $b$-Matching
10 0.65952307 233 nips-2013-Online Robust PCA via Stochastic Optimization
11 0.65843576 232 nips-2013-Online PCA for Contaminated Data
12 0.65823531 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
13 0.65715802 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
14 0.65640551 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
15 0.65596867 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
16 0.6552704 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
17 0.65502352 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
18 0.65455681 63 nips-2013-Cluster Trees on Manifolds
19 0.65321434 60 nips-2013-Buy-in-Bulk Active Learning
20 0.65233678 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces