nips nips2011 nips2011-215 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Systems with a critic that computes and broadcasts the TD error to another module called the actor, which stores the current decision rule, are called actor-critic architectures. [sent-12, score-0.551]
2 [7] present a compelling argument that the fly brain is an actor-critic by finding the neurons making up the critic and then artificially activating them to train the actor portions of the brain. [sent-14, score-0.499]
3 However, current actorcritic methods in the artificial intelligence community remain biologically implausible because each component of the actor can only be updated with detailed knowledge of the entire actor. [sent-15, score-0.354]
4 The actor in an actor-critic maintains a decision rule, π, called a policy, parameterized by a vector Input θ, that computes the probability of an action (decision), a, given an estimate of the current state of the world, st , and the current parameters, θt . [sent-20, score-0.607]
5 In some cases, an… actor can be broken into multiple … interacting modules, each of which computes an action given some input, x, Layer 1 may contain which … elements of s as well as the outputs of other modules. [sent-21, score-0.417]
6 … example of … An such a modular actor is provided in Figure 1. [sent-22, score-0.449]
7 This actor consists of three modules, A1 , A2 , and A3 , with parameters θ1 , θ2 , … Output Layer 1 Layer 2 and θ3 , respectively. [sent-23, score-0.29]
8 The ith module takes input xi , which is a subset of the state features and the outputs of other modules. [sent-24, score-0.374]
9 It then produces its action ai according to its policy, π i (xi , ai , θi ) = Pr(ai |xi , θi ). [sent-25, score-0.556]
10 The output, a, of the whole modular actor is one of the module outputs—in this case a = a3 . [sent-26, score-0.732]
11 Later we modify this to allow the action a to follow any distribution with the state and module outputs as parameters. [sent-27, score-0.418]
12 This modular policy can also be written as a non-modular policy that is a function of θ = θ1 , θ2 , θ3 , i. [sent-28, score-0.995]
13 We assume that the modular policy is not recurrent. [sent-31, score-0.577]
14 Such modular policies appear frequently in models of the human brain, with modules corresponding to neurons or collections thereof [12, 16]. [sent-32, score-0.396]
15 Were another module to modify its policy such that ∂π/∂θi changes, a message must be sent to alert Ai of the exact changes so that it can update its estimate of ∂π/∂θi , which is biologically implausible. [sent-38, score-0.754]
16 [16] use gradient descent to update a modular actor, and are forced to assume that certain derivatives are always one in order to maintain realistic locality constraints. [sent-45, score-0.374]
17 This raises the question: could each module update given only local information that does not include explicit knowledge of ∂π/∂θi ? [sent-46, score-0.337]
18 We assume that a critic exists that broadcasts the TD error, so a module’s local information would consist of its input xi , which is not necessarily a Markov state representation, its output ai , and the TD error. [sent-47, score-0.55]
19 In this paper we present a class of algorithms, called policy gradient coagent networks (PGCNs), that do exactly this: they allow modules to update given only local information. [sent-49, score-0.976]
20 ) observes the current state st , selects an action, at , based on st and θt , which is used to update the state according to P. [sent-56, score-0.404]
21 A policy is a mapping from states to probabilities of selecting each possible action. [sent-58, score-0.418]
22 A’s policy π may be parameterized by a vector, θ, such that π(s, a, θ) = Pr(at =a|st =s, θt =θ). [sent-59, score-0.418]
23 Let dθ (s) denote the stationary distribution over states M 2 under the policy induced by θ. [sent-61, score-0.418]
24 JM (θ) = lim (1) t=0 The state-value function, which maps states to the difference between the average reward and the expected reward if the agent follows the policy induced by θ starting in the provided state, is ∞ θ VM (s) = E[rt − J(θ)|M, s0 = s, θ]. [sent-63, score-0.649]
25 1 Policy Gradient One approach to improving a policy for an MDP is to adjust the parameters θ to ascend the policy gradient, θ JM (θ). [sent-66, score-0.887]
26 For reviews of policy gradient methods, see [5, 15, 24]. [sent-67, score-0.573]
27 A common variable in policy gradient methods is the compatible features, ψsa = θ log π(s, a, θ). [sent-68, score-0.597]
28 Some more advanced actor-critic methods ascend the natural policy gradient [1, 5, 15], = G(θ)−1 θ JM (θ), (8) T where G(θ) = Es∼dθ (·),a∼π(s,·,θ) [ θ log π(s, a, θ) θ log π(s, a, θ) ] is the Fisher information M matrix of the policy. [sent-76, score-0.663]
29 To help differentiate between the two types of policy gradients, we refer to the non-natural policy gradient as the vanilla policy gradient hereafter. [sent-77, score-1.664]
30 One view of the natural gradient is that it corrects for the skewing of the vanilla gradient that is induced by a particular parameterization of the policy [2]. [sent-78, score-0.867]
31 One algorithm for ascending the natural policy gradient is the Natural-Gradient Actor-Critic with Advantage Parameters [5], which we abbreviate as NAC and use in our case study. [sent-80, score-0.612]
32 For example, if s = (s1 , s2 ) and f (s) = s1 so that the policy is a function of only s1 , then the update to θ requires knowledge of only s1 , a, and δt , and not s2 . [sent-82, score-0.451]
33 This is one crucial property will allow the actor to update given only local information. [sent-83, score-0.344]
34 Hence, none of these methods allow for local updates to modular policies, which makes them undesirable from a biological standpoint, and impractical for policies for which this derivative is prohibitively difficult to compute. [sent-85, score-0.292]
35 2 and by taking advantage of Property 1, the updates to the actor can be modified to satisfy the locality constraint. [sent-87, score-0.376]
36 Though Thomas and Barto [25] present the CoMDP framework for the discounted reward setting with finite state, action, and reward spaces, the extension to the average reward and infinite setting used here is straightforward. [sent-90, score-0.297]
37 , An , where Ai has output ai ∈ Ai , where Ai is any space, though typically the reals or integers. [sent-94, score-0.289]
38 We focus on the case where Ai = {Ai , Ci } are all actor critics, i. [sent-96, score-0.29]
39 Each agent takes as input si , which contains the state of M and the outputs of an arbitrary number of other agents: si ∈ S × j Aj , where j Aj is the Cartesian product of the output sets of all the Aj whose output is used as input to Ai . [sent-108, score-0.27]
40 Thomas and Barto [25] call each Ai a coagent and the entire network a coagent network. [sent-113, score-0.363]
41 An agent Ai may treat the rest of the network and M as its environment, where it sees states si and t takes actions ai resulting in reward rt (the same for all Ai ) and a transition to state si . [sent-114, score-0.732]
42 s ¯ We write π i (si , ai , θi ) to denote Ai ’s policy for M i . [sent-116, score-0.66]
43 While [25] considers generic methods for handling this nonstationarity, we focus on the special case in which all Ai are policy gradient methods. [sent-119, score-0.573]
44 Theorem 3 of [25] states that the policy gradient of M can be decomposed into the policy gradients for all of the CoMDPs, M i : ∂JM (θ1 , θ2 , . [sent-120, score-1.061]
45 (9) 1 2 ∂θ ∂θ ∂θn Thus, if each coagent computes and follows the policy gradient based on the local environment that it sees, the coagent network will follow its policy gradient on M . [sent-142, score-1.587]
46 Thomas and Barto [25] also show that the value functions for M and all the CoMDPs are the same for all st , if the additional state components of M i are drawn according to the modular policy: 1 2 n θ θ θ θ VM 1 (s1 ) = VM 2 (s2 ) = . [sent-143, score-0.334]
47 Because all Ai share a global critic, C, all that remains of each module is the actor Ai . [sent-149, score-0.573]
48 However, if the actor’s policy is a function of some xi = f (si ) for any f , i. [sent-153, score-0.446]
49 , the policy can be written as π i (xi , ai , θi ), then, by Property 1, updates to the actor’s policy require only the TD error, ai , and xi . [sent-155, score-1.407]
50 4 3 Methods The CoMDP framework tells us that, if each module is an actor that computes the policy gradient for its local environment (CoMDP), then the entire modular actor will ascend its policy gradient. [sent-158, score-2.142]
51 Actorcritics satisfying Property 1 are able to perform their policy updates given only local information: the policy’s input xt , the most recent action at , and the TD error δt . [sent-159, score-0.57]
52 Combining these two, each module Ai can compute its update given only its local input xi , most recent action ai , and the TD t t error δt . [sent-160, score-0.679]
53 We call any network of coagents, each using policy gradient methods, a policy gradient coagent network (PGCN). [sent-161, score-1.377]
54 One PGCN is the vanilla coagent network (VCN), which uses VAC for all modules (coagents), and maintains a global critic that computes and broadcasts δt . [sent-162, score-0.694]
55 Notice that δt ψxi ai is an unbiased estimate of the policy t t gradient for M i [5], which is an unbiased estimate of part of the policy gradient for M by Equation 9. [sent-164, score-1.434]
56 The global critic observes ˆ st , rt , st+1 tuples, updates its estimate J of the average reward, which it uses to compute the TD error δt , which is then broadcast to all of the modules, Ai . [sent-166, score-0.506]
57 Each module Ai draws its actions from π i (xi , ·, θt ) and then computes updates t i i i to θ given its input xt , action at , and the TD error, δt , which was broadcast by the global critic. [sent-168, score-0.53]
58 To implement VCN, observe the current state st , compute the module outputs ai and then at = t Γ(st , a1 , a2 , . [sent-169, score-0.723]
59 This action will result in a transition to st+1 with reward rt . [sent-173, score-0.281]
60 Given st , rt , and t t t st+1 the global critic can execute to produce δt , which can then be used to train each module Ai . [sent-174, score-0.668]
61 4 The Decomposed Natural Policy Gradient Another interesting PGCN, which we call a natural coagent network (NCN), would use coagents that ascend the natural policy gradient, e. [sent-177, score-0.821]
62 , θn and θ JM (θ) is an estimate of the natural policy gradient that we call the decomposed natural policy gradient, which has an implicit dependence on how θ is partitioned into n components. [sent-186, score-1.139]
63 The decomposed natural policy gradient is intuitively a trade-off between the natural policy gradient and the vanilla policy gradient depending on the granularity of modularization. [sent-188, score-1.967]
64 For example, if the 5 policy is one module, A1 , and Γ(s, a1 ) = a1 , then the decomposed natural policy gradient is trivially the same as the natural policy gradient. [sent-189, score-1.557]
65 On the other hand, as the policy is broken into more and more modules, the gradient begins to differ more and more from the natural policy gradient, because the structure of the modular policy begins to influence the direction of the gradient. [sent-190, score-1.607]
66 If the modular actor contains one parameter per module and the module inputs are normalized, it is possible for G(θ)−1 = I, in which case the decomposed natural policy gradient will be equivalent to the vanilla policy gradient. [sent-192, score-2.215]
67 Hence, the more coarse the modularization (fewer modules), the closer the decomposed natural policy gradient is to the natural policy gradient, while the finer the modularization (more modules), the closer the decomposed natural policy gradient may come to the vanilla policy gradient. [sent-193, score-2.389]
68 Each term of the decomposed natural policy gradient is within ninety degrees of the vanilla policy gradient, so a system will converge to a local optimum if it follows the decomposed natural policy gradient and the step size is decayed appropriately. [sent-194, score-1.903]
69 In this section we focus on VAC due to its simplicity, though the argument that stochasticity in the CoMDP is the root cause of the variance of gradient estimates carries over to PGCNs using other actor-critic methods as well. [sent-202, score-0.322]
70 [5] show that E[δt |st = s, at = a, M, θ] can be viewed as the advantage of taking action at in state st over following the policy induced by θ. [sent-206, score-0.665]
71 Thus, the gradient estimates are influenced by the stochasticity of the transition function P and reward function R. [sent-211, score-0.369]
72 Consider the modular actor from Figure 1 in the case where the transitions and rewards of M are deterministic. [sent-215, score-0.48]
73 We therefore expect policy gradient estimates for their parameters to have higher variance. [sent-219, score-0.606]
74 In summary, the stochasticity in the CoMDPs is responsible for VCN’s policy gradient estimates having higher variance than those of VAC. [sent-220, score-0.714]
75 We performed a simple study using the modular actor from Figure 1 on a 10 × 10 gridworld with deterministic actions {up, down, lef t, right}, a reward of −1 for all transitions, factored state (¯, y ), and with a terminal state at (10, 10). [sent-221, score-0.725]
76 For the modular actor, A1 = A2 = {0, 1}, x ¯ A3 = {up, down, lef t, right}, A1 and A2 both received the full state (¯, y ), and all modules used x ¯ 6 0. [sent-222, score-0.39]
77 All modules also used softmax action selection: i i eτ θa ·x , i τ θa ·xi ˆ a∈Ai e ˆ π i (xi , a, θi ) = (15) where τ is a constant scaling the amount of exploration, and where the parameters θi for the ith i module contain a weight vector θa for each action a ∈ Ai . [sent-245, score-0.593]
78 With all actor weights fixed and selected randomly with uniform distribution from (−1, 1), we first i observed that the mean of the updates δt ψst ,at ,i and δt ψxi ,ai are approximately equal, as expected, and then computed the variance of both updates. [sent-247, score-0.4]
79 As predicted, the variance of the gradient estimates for each parameter of A1 and A2 is larger for VCN, though the variance of the gradient estimate for each parameter of A3 is similar for VCN and VAC. [sent-249, score-0.471]
80 6 Variance Mitigation To mitigate the increase in the variance of gradient estimates, we observe that, in general, the additional variance due to the other modules can be completely removed for a module Ai if every other module is made to be deterministic. [sent-250, score-0.989]
81 This is not practical because every module must explore in order to learn. [sent-251, score-0.283]
82 However, we can approximate it by decreasing the exploration of each module, making its policy less stochastic and more greedy. [sent-252, score-0.497]
83 For example, every module could take a deterministic greedy action without performing any updates with probability 1 − ε for some ε ∈ [0, 1). [sent-253, score-0.414]
84 With probability ε the module would act using softmax action selection and update its parameters. [sent-254, score-0.388]
85 As ε → 0, the probability of two modules exploring simultaneously goes to zero, decreasing the variance in M i but also decreasing the percent of time steps during which each module trains. [sent-255, score-0.54]
86 When ε = 1, every module explores and updates on every step, so the algorithm is the original PGCN algorithm (VCN if using VAC for each module). [sent-256, score-0.342]
87 Thus, if the variance in gradient estimates precludes learning, we suggest making the policies of the modules more deterministic by decreasing exploration and increasing exploitation. [sent-260, score-0.514]
88 Lastly, is there a significant loss in performance when using the decomposed natural policy gradient as opposed to the true natural policy gradient? [sent-264, score-1.139]
89 Random policy parameters average less than −5000 reward per episode. [sent-297, score-0.517]
90 To perform a thorough analysis, we again use the modular actor depicted in Figure 1, as in Section 5. [sent-300, score-0.449]
91 We select the gridworld from Section 5, and again use a tabular critic in order to focus on the difference in policy improvements. [sent-302, score-0.639]
92 Though decreased exploration is beneficial in general, for this particular modular policy it is therefore particularly important that A3 ’s exploration be decreased by increasing τ3 . [sent-307, score-0.695]
93 The optimization does just this, balancing the trade-off between exploration and the variance of gradient estimates by selecting larger τ3 for VCN than VAC. [sent-308, score-0.298]
94 For NAC and NCN, the exploration parameters are identical, suggesting that the additional variance of gradient estimates was not significant. [sent-312, score-0.316]
95 This is likely due to the policy gradient estimates being filtered before being used. [sent-313, score-0.606]
96 The average rewards during a lifetime are similar, suggesting that, even though the variance of gradient estimates can be orders larger for VCN with τ12 = τ3 = 1 (Figure 3(a)), exploration can be tuned such that learning speed is not significantly diminished. [sent-314, score-0.363]
97 8 Conclusion We have devised a class of algorithms, policy gradient coagent networks (PGCNs), and two specific instantiations thereof, the natural coagent network (NCN) and vanilla coagent network (VCN), which allow modules within an actor to update given only local information. [sent-315, score-1.801]
98 We show that the NCN ascends the decomposed natural policy gradient, an approximation to the natural policy gradient, while VCN ascends the vanilla policy gradient. [sent-316, score-1.552]
99 We discussed the theoretical properties of both the decomposed natural policy gradient and the increase in the variance of gradient estimates when using PGCNs. [sent-317, score-0.921]
100 We point out how VCN’s similar performance is achieved by decreasing exploration in order to decrease the stochasticity of each module’s CoMDP, and thus the variance of the gradient estimates. [sent-320, score-0.342]
wordName wordTfidf (topN-words)
[('policy', 0.418), ('vcn', 0.368), ('actor', 0.29), ('module', 0.283), ('ai', 0.242), ('jm', 0.21), ('vac', 0.203), ('modules', 0.166), ('coagent', 0.165), ('critic', 0.165), ('modular', 0.159), ('gradient', 0.155), ('st', 0.135), ('comdp', 0.127), ('td', 0.121), ('ncn', 0.114), ('vm', 0.113), ('nac', 0.1), ('vanilla', 0.1), ('reward', 0.099), ('rt', 0.085), ('coagents', 0.076), ('pgcn', 0.076), ('action', 0.072), ('aj', 0.07), ('decomposed', 0.07), ('si', 0.066), ('pgcns', 0.063), ('reinforcement', 0.061), ('exploration', 0.059), ('updates', 0.059), ('stochasticity', 0.057), ('variance', 0.051), ('ascend', 0.051), ('comdps', 0.051), ('bhatnagar', 0.044), ('actions', 0.043), ('vt', 0.042), ('broadcast', 0.041), ('basal', 0.041), ('state', 0.04), ('natural', 0.039), ('ati', 0.038), ('xti', 0.038), ('decision', 0.038), ('sutton', 0.037), ('barto', 0.035), ('update', 0.033), ('broadcasts', 0.033), ('estimates', 0.033), ('agent', 0.033), ('network', 0.033), ('computes', 0.032), ('agents', 0.031), ('transitions', 0.031), ('ganglia', 0.031), ('policies', 0.03), ('gridworld', 0.029), ('lastly', 0.029), ('xi', 0.028), ('pr', 0.028), ('tabular', 0.027), ('locality', 0.027), ('though', 0.026), ('ascends', 0.025), ('critics', 0.025), ('lef', 0.025), ('modularization', 0.025), ('transition', 0.025), ('environment', 0.025), ('compatible', 0.024), ('prohibitively', 0.023), ('unbiased', 0.023), ('neurons', 0.023), ('outputs', 0.023), ('layer', 0.023), ('neurobiology', 0.023), ('dopamine', 0.022), ('actorcritic', 0.022), ('implausible', 0.022), ('varia', 0.022), ('ance', 0.022), ('thomas', 0.022), ('output', 0.021), ('brain', 0.021), ('local', 0.021), ('observes', 0.021), ('rivest', 0.021), ('lifetime', 0.021), ('decreasing', 0.02), ('biologically', 0.02), ('jt', 0.02), ('notice', 0.02), ('conjugate', 0.019), ('jth', 0.019), ('suggesting', 0.018), ('thereof', 0.018), ('ak', 0.018), ('mdp', 0.018), ('networks', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
2 0.36744022 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
3 0.21635127 283 nips-2011-The Fixed Points of Off-Policy TD
Author: J. Z. Kolter
Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1
4 0.21537831 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
Author: Jaedeug Choi, Kee-eung Kim
Abstract: The difficulty in inverse reinforcement learning (IRL) arises in choosing the best reward function since there are typically an infinite number of reward functions that yield the given behaviour data as optimal. Using a Bayesian framework, we address this challenge by using the maximum a posteriori (MAP) estimation for the reward function, and show that most of the previous IRL algorithms can be modeled into our framework. We also present a gradient method for the MAP estimation based on the (sub)differentiability of the posterior distribution. We show the effectiveness of our approach by comparing the performance of the proposed method to those of the previous algorithms. 1
5 0.18916936 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
6 0.18318164 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
7 0.18279879 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
8 0.17596577 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
9 0.159353 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
10 0.14915675 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
11 0.13470536 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
12 0.12138724 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
13 0.11948993 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
14 0.11706511 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
15 0.11559536 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
16 0.11499006 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
17 0.10913777 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
18 0.10532475 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
19 0.079649597 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
20 0.077092044 145 nips-2011-Learning Eigenvectors for Free
topicId topicWeight
[(0, 0.174), (1, -0.23), (2, 0.145), (3, 0.274), (4, -0.315), (5, 0.097), (6, -0.007), (7, 0.016), (8, -0.062), (9, -0.076), (10, 0.125), (11, -0.01), (12, 0.005), (13, -0.003), (14, -0.065), (15, 0.013), (16, 0.009), (17, 0.087), (18, 0.005), (19, 0.001), (20, 0.012), (21, -0.094), (22, 0.041), (23, 0.029), (24, -0.01), (25, -0.052), (26, 0.016), (27, 0.03), (28, 0.071), (29, 0.007), (30, 0.056), (31, 0.045), (32, 0.042), (33, -0.032), (34, -0.029), (35, 0.042), (36, -0.048), (37, 0.081), (38, 0.01), (39, 0.001), (40, -0.002), (41, -0.001), (42, -0.028), (43, -0.039), (44, -0.028), (45, 0.058), (46, -0.108), (47, 0.039), (48, -0.002), (49, 0.153)]
simIndex simValue paperId paperTitle
same-paper 1 0.9740988 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
2 0.88495618 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
3 0.81744993 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama
Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1
4 0.76875228 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
Author: Oliver B. Kroemer, Jan R. Peters
Abstract: In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy evaluation methods. In particular, we show that the same Galerkin method can be used to derive Least-Squares Temporal Difference learning, Kernelized Temporal Difference learning, and a discrete-state Dynamic Programming solution, as well as our proposed method. In a numerical evaluation of these algorithms, the proposed approach performed better than the other methods. 1
5 0.75166899 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
6 0.74752623 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
7 0.74124151 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
8 0.65266782 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
9 0.60167581 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
10 0.57392931 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
11 0.56099069 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
12 0.55977666 283 nips-2011-The Fixed Points of Off-Policy TD
13 0.50911438 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
14 0.48782176 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
15 0.46652758 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
16 0.43161541 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
17 0.3985053 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
18 0.38628125 256 nips-2011-Solving Decision Problems with Limited Information
19 0.34462646 11 nips-2011-A Reinforcement Learning Theory for Homeostatic Regulation
20 0.34282947 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories
topicId topicWeight
[(4, 0.049), (11, 0.017), (20, 0.015), (25, 0.311), (26, 0.017), (31, 0.153), (33, 0.021), (43, 0.041), (45, 0.074), (57, 0.026), (74, 0.034), (83, 0.038), (89, 0.012), (99, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.78369159 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
2 0.64480215 269 nips-2011-Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
Author: Miguel Lázaro-gredilla, Michalis K. Titsias
Abstract: We introduce a variational Bayesian inference algorithm which can be widely applied to sparse linear models. The algorithm is based on the spike and slab prior which, from a Bayesian perspective, is the golden standard for sparse inference. We apply the method to a general multi-task and multiple kernel learning model in which a common set of Gaussian process functions is linearly combined with task-specific sparse weights, thus inducing relation between tasks. This model unifies several sparse linear models, such as generalized linear models, sparse factor analysis and matrix factorization with missing values, so that the variational algorithm can be applied to all these cases. We demonstrate our approach in multioutput Gaussian process regression, multi-class classification, image processing applications and collaborative filtering. 1
3 0.61453837 64 nips-2011-Convergent Bounds on the Euclidean Distance
Author: Yoonho Hwang, Hee-kap Ahn
Abstract: Given a set V of n vectors in d-dimensional space, we provide an efficient method for computing quality upper and lower bounds of the Euclidean distances between a pair of vectors in V . For this purpose, we define a distance measure, called the MS-distance, by using the mean and the standard deviation values of vectors in V . Once we compute the mean and the standard deviation values of vectors in V in O(dn) time, the MS-distance provides upper and lower bounds of Euclidean distance between any pair of vectors in V in constant time. Furthermore, these bounds can be refined further in such a way to converge monotonically to the exact Euclidean distance within d refinement steps. An analysis on a random sequence of refinement steps shows that the MS-distance provides very tight bounds in only a few refinement steps. The MS-distance can be used to various applications where the Euclidean distance is used to measure the proximity or similarity between objects. We provide experimental results on the nearest and the farthest neighbor searches. 1
4 0.52591532 249 nips-2011-Sequence learning with hidden units in spiking neural networks
Author: Johanni Brea, Walter Senn, Jean-pascal Pfister
Abstract: We consider a statistical framework in which recurrent networks of spiking neurons learn to generate spatio-temporal spike patterns. Given biologically realistic stochastic neuronal dynamics we derive a tractable learning rule for the synaptic weights towards hidden and visible neurons that leads to optimal recall of the training sequences. We show that learning synaptic weights towards hidden neurons significantly improves the storing capacity of the network. Furthermore, we derive an approximate online learning rule and show that our learning rule is consistent with Spike-Timing Dependent Plasticity in that if a presynaptic spike shortly precedes a postynaptic spike, potentiation is induced and otherwise depression is elicited.
5 0.52133107 75 nips-2011-Dynamical segmentation of single trials from population neural data
Author: Biljana Petreska, Byron M. Yu, John P. Cunningham, Gopal Santhanam, Stephen I. Ryu, Krishna V. Shenoy, Maneesh Sahani
Abstract: Simultaneous recordings of many neurons embedded within a recurrentlyconnected cortical network may provide concurrent views into the dynamical processes of that network, and thus its computational function. In principle, these dynamics might be identified by purely unsupervised, statistical means. Here, we show that a Hidden Switching Linear Dynamical Systems (HSLDS) model— in which multiple linear dynamical laws approximate a nonlinear and potentially non-stationary dynamical process—is able to distinguish different dynamical regimes within single-trial motor cortical activity associated with the preparation and initiation of hand movements. The regimes are identified without reference to behavioural or experimental epochs, but nonetheless transitions between them correlate strongly with external events whose timing may vary from trial to trial. The HSLDS model also performs better than recent comparable models in predicting the firing rate of an isolated neuron based on the firing rates of others, suggesting that it captures more of the “shared variance” of the data. Thus, the method is able to trace the dynamical processes underlying the coordinated evolution of network activity in a way that appears to reflect its computational role. 1
6 0.51872206 241 nips-2011-Scalable Training of Mixture Models via Coresets
7 0.51559812 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems
8 0.51452875 229 nips-2011-Query-Aware MCMC
9 0.51309574 240 nips-2011-Robust Multi-Class Gaussian Process Classification
10 0.50795078 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
11 0.50780159 23 nips-2011-Active dendrites: adaptation to spike-based communication
12 0.50638437 66 nips-2011-Crowdclustering
13 0.5031898 180 nips-2011-Multiple Instance Filtering
14 0.50192428 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
15 0.50165981 221 nips-2011-Priors over Recurrent Continuous Time Processes
16 0.50161004 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
17 0.4984405 225 nips-2011-Probabilistic amplitude and frequency demodulation
18 0.49742356 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
19 0.49691761 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
20 0.4968681 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems