nips nips2011 nips2011-212 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
Reference: text
sentIndex sentText sentNum sentScore
1 Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. [sent-10, score-0.585]
2 The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). [sent-11, score-0.474]
3 We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. [sent-12, score-0.398]
4 Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. [sent-13, score-1.032]
5 This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. [sent-14, score-1.121]
6 The planning task can often be represented as a reinforcement learning problem, where an action policy controls the behavior of an agent, and the quality of the policy is optimized to maximize a reward function. [sent-18, score-0.663]
7 Single agent policies can be optimized with partially observable Markov decision processes (POMDPs) [1], when the world state is uncertain. [sent-19, score-0.284]
8 Decentralized POMDPs (DEC-POMDPs) [2] optimize policies for multiple agents that act without direct communication, with separate observations and beliefs of the world state, to maximize a joint reward function. [sent-20, score-0.383]
9 POMDP and DEC-POMDP methods use various representations for the policies, such as value functions [3], graphs [4, 5], or finite state controllers (FSCs) [6, 7, 8, 9, 10]. [sent-21, score-0.239]
10 We introduce a new policy representation: periodic finite state controllers, which can be seen as an intelligent restriction which speeds up optimization and can yield better solutions. [sent-24, score-0.659]
11 A periodic FSC is composed of several layers (subsets of states), and transitions are only allowed to states in the next layer, and from the final layer to the first. [sent-25, score-0.572]
12 Policies proceed through layers in a periodic fashion, and policy optimization determines the probabilities of state transitions and action choices to maximize reward. [sent-26, score-0.74]
13 Secondly, we give a method to transform the finitehorizon FSC into an initial infinite-horizon periodic FSC. [sent-29, score-0.368]
14 Fourthly, we introduce an expectation-maximization (EM) training algorithm for planning with periodic FSCs. [sent-31, score-0.432]
15 We show that the resulting method performs better than earlier DEC1 POMDP methods and POMDP methods with a restricted-size policy and that use of the periodic FSCs enables computing larger solutions than with regular FSCs. [sent-32, score-0.621]
16 Online execution has complexity O(const) for deterministic FSCs and O(log(F SC layer width)) for stochastic FSCs. [sent-33, score-0.242]
17 In Section 3 we introduce the novel concept of periodic FSCs. [sent-35, score-0.368]
18 We then describe the stages of our method: improving finite-horizon solutions, transforming them to periodic infinite-horizon solutions, and improving the periodic solutions by a novel EM algorithm for (DEC-)POMDPs (Section 3. [sent-36, score-0.754]
19 POMDPs optimize policies for a single agent with uncertainty of the environment state while DEC-POMDPs optimize policies for several agents with uncertainty of the environment state and each other’s states. [sent-40, score-0.568]
20 Given the actions of the agents the environment evolves according to a Markov model. [sent-41, score-0.187]
21 The agents’ policies are optimized to maximize the expected reward earned for actions into the future. [sent-42, score-0.214]
22 For infinite-horizon DEC-POMDP problems, state of the art methods [8, 12] store the policy as a stochastic finite state controller (FSC) for each agent which keeps the policy size bounded. [sent-45, score-0.746]
23 In a recent variant called mealy NLP [16], the NLP based approach to DEC-POMDPs is adapted to FSC policies represented by Mealy machines instead of traditional Moore machine representations. [sent-52, score-0.315]
24 In POMDPs, Mealy machine based controllers can achieve equal or better solutions than Moore controllers of the same size. [sent-53, score-0.422]
25 We introduce an approach where FSCs have a periodic layer structure, which turns out to yield good results. [sent-55, score-0.542]
26 1 Infinite-horizon DEC-POMDP: definition The tuple {αi }, S, {Ai }, P, {Ωi }, O, R, b0 , γ defines an infinite-horizon DEC-POMDP problem for N agents αi , where S is the set of environment states and Ai and Ωi are the sets of possible actions and observations for agent αi . [sent-57, score-0.292]
27 P (s′ |s, a) is the probability to move from state s to s′ , given the actions of all agents (jointly denoted a = a1 , . [sent-59, score-0.199]
28 , oN , where oi is the observation of agent i, when actions a were taken and the environment transitioned to state s′ . [sent-66, score-0.399]
29 For brevity, we denote transition probabilities given the actions by Ps′ sa , observation probabilities by Pos′ a , reward functions by Rsa , and the set of all agents other than i by ¯ At each time step, agents perform actions, the environment state changes, and agents i. [sent-69, score-0.588]
30 The goal is to find a joint policy π for the agents that maximizes expected ∞ t discounted infinite-horizon reward E t=0 γ Rs(t)a(t) |π , where γ is the discount factor, and s(t) and a(t) are the state and action at time t, and E[·|π] denotes expected value under policy π. [sent-71, score-0.794]
31 Here, the policy is stored as a set of stochastic finite state controllers (FSCs), one for each agent. [sent-72, score-0.5]
32 Figure 1: Left: influence diagram for a DEC-POMDP with finite state controllers q, states s, joint observations o, joint actions a and reward r (given by a reward function R(s, a)). [sent-82, score-0.412]
33 Right: an example of the new periodic finite state controller, with three layers and three nodes in each layer, and possible transitions shown as arrows. [sent-84, score-0.478]
34 Which layer is active depends only on the current time; which node is active, and which action is chosen, depend on transition probabilities and action probabilities of the controller. [sent-86, score-0.362]
35 The policies are ′ optimized by optimizing the parameters νqi , πai qi , and λqi qi oi . [sent-92, score-0.859]
36 3 Periodic finite state controllers State-of-the-art algorithms [6, 13, 8, 12, 16] for optimizing POMDP/DEC-POMDP policies with restricted-size FSCs find a local optimum. [sent-94, score-0.33]
37 We introduce periodic FSCs, which allow the use of much larger controllers with a small complexity increase, efficient FSC initialization, and new dynamic programming algorithms for FSCs. [sent-99, score-0.616]
38 A periodic FSC is composed of M layers of controller nodes. [sent-100, score-0.488]
39 Nodes in each layer are connected only to nodes in the next layer: the first layer is connected to the second, the second layer to the third and so on, and the last layer is connected to the first. [sent-101, score-0.663]
40 The width of a periodic FSC is the number of controller nodes in a layer. [sent-102, score-0.501]
41 A periodic FSC has different action and (m) transition probabilities for each layer. [sent-105, score-0.438]
42 πai qi is the layer m probability to take action ai when in node (m) ′ qi , and λq′ qi oi is the layer m probability to move from node qi to qi when observing oi . [sent-106, score-2.353]
43 Each layer i connects only to the next one, so the policy cycles periodically through each layer: for t ≥ M we (t mod M) (t) (t mod M) (t) where ‘mod’ denotes remainder. [sent-107, score-0.428]
44 Figure 1 (right) and λq′ qi oi = λq′ qi oi have πai qi = πai qi i i shows an example periodic FSC. [sent-108, score-1.854]
45 We now introduce our method for solving (DEC-)POMDPs with periodic FSC policies. [sent-109, score-0.368]
46 We show that the periodic FSC structure allows efficient computation of deterministic controllers, show how to optimize periodic stochastic FSCs, and show how a periodic deterministic controller can be used as initialization to a stochastic controller. [sent-110, score-1.424]
47 1 Deterministic periodic finite state controllers In a deterministic FSC, actions and node transitions are deterministic functions of the current node and observation. [sent-113, score-0.905]
48 To optimize deterministic periodic FSCs we first compute a non-periodic finitehorizon policy. [sent-114, score-0.457]
49 The finite-horizon policy is transformed into a periodic infinite-horizon policy by connecting the last layer to the first layer and the resulting deterministic policy can then be im3 proved with a new algorithm (see Section 3. [sent-115, score-1.444]
50 A periodic deterministic policy can also be used as initialization for a stochastic FSC optimizer based on expectation maximization (see Section 3. [sent-118, score-0.718]
51 1 Deterministic finite-horizon controllers We briefly discuss existing methods for deterministic finite-horizon controllers and introduce an improved finite-horizon method, which we use as the initial solution for infinite-horizon controllers. [sent-122, score-0.465]
52 State-of-the-art point based finite-horizon DEC-POMDP methods [4, 5] optimize a policy graph, with restricted width, for each agent. [sent-123, score-0.263]
53 They compute a policy for a single belief, instead of all possible beliefs. [sent-124, score-0.235]
54 At each time step a policy is computed for each policy graph node, by assuming that the nodes all agents are in are associated with the same belief. [sent-127, score-0.667]
55 In a POMDP, computing the deterministic policy for a policy graph node means finding the best action, and the best connection (best next node) for each observation; this can be done with a direct search. [sent-128, score-0.632]
56 A more efficient way is to go through all action combinations, for each action combination sample random policies for all agents, and then improve the policy of each agent in turn while holding the other agents’ policies fixed. [sent-130, score-0.599]
57 This is not guaranteed to find the best policy for a belief, but has yielded good results in the Point-Based Policy Generation (PBPG) algorithm [5]. [sent-131, score-0.235]
58 PBPG used linear programming to find policies for each agent and action-combination, but with a fixed joint action and fixed policies of other agents we can use fast and simple direct search as follows. [sent-133, score-0.462]
59 Construct an initial policy graph for each agent, starting from horizon t = T : (1) Project the initial belief along a random trajectory to horizon t to yield a sampled belief b(s) over world states. [sent-135, score-0.489]
60 (2) Add, to the graph of each agent, a node to layer t. [sent-136, score-0.256]
61 Find the best connections to the next layer as follows. [sent-137, score-0.187]
62 The best connections and action combination a become the policy for the current policy graph node. [sent-139, score-0.584]
63 (3) Run (1)-(2) until the graph layer has enough nodes. [sent-140, score-0.189]
64 At each layer, we optimize each agent separately: for each graph node qi of agent i, for each action ai of the agent, and for each observation oi we optimize the (deterministic) connection to the next layer. [sent-144, score-0.961]
65 Our optimization monotonically improves the value of a fixed size policy graph and converges to a local optimum. [sent-150, score-0.269]
66 2 Deterministic infinite-horizon controllers To initialize an infinite-horizon problem, we transform a deterministic finite-horizon policy graph (computed as in Section 3. [sent-158, score-0.532]
67 1) into an infinite-horizon periodic controller by connecting the last layer to the first. [sent-160, score-0.613]
68 Assuming controllers start from policy graph node 1, we compute policies for the other nodes in the first layer with beliefs sampled for time step M + 1, where M is the length of the controller period. [sent-161, score-0.98]
69 It remains to compute the (deterministic) connections from the last layer to the first: approximately optimal connections are found using the beliefs at the last layer and the value function projected from the last layer through the graph to the first layer. [sent-162, score-0.626]
70 This approach can yield efficient controllers on its own, but may not be suitable for problems with a long effective horizon. [sent-163, score-0.221]
71 To optimize controllers further, we give two changes to Algorithm 1 that enable optimization of infinite-horizon policies: (1) To compute beliefs ˆu (s, q) over time steps u by projecting the initial b belief, first determine an effective projection horizon Tproj . [sent-164, score-0.356]
72 Compute a QMDP policy [18] (an upper bound to the optimal DEC-POMDP policy) by dynamic programming. [sent-165, score-0.255]
73 Compute the belief bt (s, q) for each FSC layer t (needed on line 2 of Algorithm 1) as a 1 discounted sum of projected beliefs: bt (s, q) = C u∈{t,t+M,t+2M,. [sent-167, score-0.279]
74 (2) b Compute value function Vt (s, q) for a policy graph layer by backing up (using line 14 of Algorithm 1) M − 1 steps from the previous periodic FSC layer to current FSC layer, one layer at a time. [sent-171, score-1.102]
75 2 Expectation maximization for stochastic infinite-horizon controllers A stochastic FSC provides a solution of equal or larger value [6] compared to a deterministic FSC with the same number of controller nodes. [sent-175, score-0.405]
76 Many algorithms that optimize stochastic FSCs could be adapted to use periodic FSCs; in this paper we adapt the expectation-maximization (EM) approach [7, 12] to periodic FSCs. [sent-176, score-0.814]
77 We now introduce an EM algorithm for (DEC-)POMDPs with periodic stochastic FSCs. [sent-179, score-0.394]
78 the stochastic periodic finite state controllers, in each iteration. [sent-186, score-0.431]
79 ˆ In the E-step, alpha messages α(m) (q, s) and beta messages β (m) (q, s) are computed for each layer ˆ of the periodic FSC. [sent-188, score-0.633]
80 The alpha messages are computed by projecting an initial nodesand-state distribution forward, while beta messages are computed by projecting reward probabilities ˆ backward. [sent-190, score-0.226]
81 For a periodic FSC the forward projection of the joint distribution over world and (t) (t) FSC states from time step t to time step t + 1 is Pt (q ′ , s′ |q, s) = o,a Ps′ sa Pos′ a i [πai qi λq′ q˜o˜ ]. [sent-194, score-0.703]
82 ˜ i i i Each α(m) (q, s) can be computed by projecting a single trajectory forward starting from the iniˆ tial belief and then adding only messages belonging to layer m to each α(m) (q, s). [sent-195, score-0.258]
83 Denoting such projections by β0 (q, s) = a Rsa i πai qi and (m) ′ ′ (m) βt (q, s) = s′ ,q′ βt−1 (q , s )Pt (q ′ , s′ |q, s) the equations for the messages become TM −1 T (m) ˆ γ (m+tm M) (1−γ)α(m+tmM) (q, s) and β (m) (q, s) = α(m) (q, s) = ˆ tm =0 γ t (1−γ)βt (q, s) . [sent-197, score-0.342]
84 t=0 (1) This means that the complexity of the E-step for periodic FSCs is M times the complexity of the E-step for usual FSCs with a total number of nodes equal to the width of the periodic FSC. [sent-198, score-0.779]
85 In the M-step we can update the parameters of each layer separately using the alpha and beta messages for that layer, as follows. [sent-200, score-0.225]
86 For periodic FSCs P (r = 1, L, T |θ) is T (t) ˆ P (r = 1, L, T |θ) = P (T )[Rsa ]t=T (t) where we denoted τaq = t=1 (t) i τaq Ps′ sa Pos′ a Λq′ qot (0) πai qi for t = 1, . [sent-202, score-0.68]
87 , T , τaq = (0) τaq b0 (s) (2) t=0 (0) i πai qi νqi , and Λq′ qot = (t−1) i λq′ qi oi . [sent-205, score-0.764]
88 i The log in the expected complete log-likelihood Q(θ, θ∗ ) transforms the product of probabilities into a sum; we can divide the sums into smaller sums, where each sum contains only parameters from ˆ the same periodic FSC layer. [sent-206, score-0.39]
89 ˆ i i i Cqi oi ˜ i i i ′ s,s′ ,q˜ ,q˜,o˜,a i i i (5) Note about initialization. [sent-208, score-0.209]
90 2) yields deterministic periodic controllers as initializations; a deterministic finite state controller is a stable point of the EM algorithm, since for such a controller the M-step of the EM approach does not change the probabilities. [sent-213, score-0.909]
91 To allow EM to escape the stable point and find even better optima, we add noise to the controllers in order to produce stochastic controllers that can be improved by EM. [sent-214, score-0.43]
92 For both types of benchmarks we ran the proposed infinite-horizon method for deterministic controllers (denoted “Peri”) with nine improvement rounds as described in Section 3. [sent-216, score-0.263]
93 For DEC-POMDP benchmarks we also ran the proposed periodic expectation maximization approach in Section 3. [sent-219, score-0.368]
94 EM was run for all problems and Mealy NLP for the Hallway2, decentralized tiger, recycling robots, and wireless network problems. [sent-231, score-0.203]
95 Table 1 shows DEC-POMDP results for the decentralized tiger, recycling robots, meeting in a grid, wireless network [10], co-operative box pushing, and stochastic mars rover problems. [sent-233, score-0.25]
96 The proposed method “Peri” performed best in the DEC-POMDP problems and better than other restricted policy size methods in the POMDP problems. [sent-242, score-0.235]
97 5 Conclusions and discussion We introduced a new class of finite state controllers, periodic finite state controllers (periodic FSCs), and presented methods for initialization and policy improvement. [sent-244, score-0.907]
98 In addition to the expectation-maximization presented here, other optimization algorithms for infinite-horizon problems could also be adapted to periodic FSCs: for example, the non-linear programming approach [8] could be adapted to periodic FSCs. [sent-247, score-0.81]
99 In brief, a separate value function and separate FSC parameters would be used for each time slice in the periodic FSCs, and the number of constraints would grow linearly with the number of time slices. [sent-248, score-0.368]
100 Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. [sent-348, score-0.233]
wordName wordTfidf (topN-words)
[('fsc', 0.441), ('periodic', 0.368), ('fscs', 0.326), ('qi', 0.267), ('policy', 0.235), ('nlp', 0.217), ('oi', 0.209), ('controllers', 0.202), ('mealy', 0.2), ('layer', 0.155), ('peri', 0.137), ('pomdp', 0.125), ('pomdps', 0.123), ('agents', 0.12), ('decentralized', 0.119), ('ai', 0.108), ('em', 0.103), ('periem', 0.095), ('policies', 0.091), ('controller', 0.09), ('agent', 0.086), ('node', 0.067), ('sarsop', 0.065), ('planning', 0.064), ('beliefs', 0.063), ('deterministic', 0.061), ('reward', 0.056), ('wireless', 0.056), ('pt', 0.054), ('action', 0.048), ('horizon', 0.044), ('belief', 0.044), ('nodes', 0.043), ('actions', 0.042), ('messages', 0.04), ('discount', 0.039), ('aalto', 0.037), ('perseus', 0.037), ('aq', 0.037), ('rsa', 0.037), ('state', 0.037), ('tm', 0.035), ('graph', 0.034), ('nitehorizon', 0.034), ('pos', 0.034), ('connections', 0.032), ('finland', 0.032), ('bpi', 0.032), ('foreach', 0.032), ('hpi', 0.032), ('ifaamas', 0.032), ('oam', 0.032), ('alpha', 0.03), ('layers', 0.03), ('qj', 0.029), ('initialization', 0.028), ('optimize', 0.028), ('recycling', 0.028), ('redirect', 0.028), ('amato', 0.028), ('rmin', 0.028), ('bt', 0.028), ('robots', 0.027), ('programming', 0.026), ('ps', 0.026), ('stochastic', 0.026), ('world', 0.025), ('optimized', 0.025), ('environment', 0.025), ('sa', 0.024), ('adapted', 0.024), ('oj', 0.024), ('discounted', 0.024), ('probabilities', 0.022), ('aamas', 0.022), ('vt', 0.022), ('aloha', 0.021), ('cqi', 0.021), ('jaakko', 0.021), ('joni', 0.021), ('mars', 0.021), ('pajarinen', 0.021), ('pbpg', 0.021), ('pbpi', 0.021), ('ptai', 0.021), ('qot', 0.021), ('seuken', 0.021), ('tproj', 0.021), ('duplicate', 0.021), ('dynamic', 0.02), ('observable', 0.02), ('mod', 0.019), ('projecting', 0.019), ('states', 0.019), ('bernstein', 0.019), ('sq', 0.019), ('yield', 0.019), ('solutions', 0.018), ('decpomdps', 0.018), ('zilberstein', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
2 0.2243984 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
3 0.20337938 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
4 0.16626558 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
Author: João V. Messias, Matthijs Spaan, Pedro U. Lima
Abstract: Factored Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) form a powerful framework for multiagent planning under uncertainty, but optimal solutions require a rigid history-based policy representation. In this paper we allow inter-agent communication which turns the problem in a centralized Multiagent POMDP (MPOMDP). We map belief distributions over state factors to an agent’s local actions by exploiting structure in the joint MPOMDP policy. The key point is that when sparse dependencies between the agents’ decisions exist, often the belief over its local state factors is sufficient for an agent to unequivocally identify the optimal action, and communication can be avoided. We formalize these notions by casting the problem into convex optimization form, and present experimental results illustrating the savings in communication that we can obtain.
5 0.159353 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
6 0.12296474 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
7 0.10941378 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
8 0.084457949 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
9 0.084259652 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
10 0.082176939 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
11 0.081895843 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
12 0.073157221 158 nips-2011-Learning unbelievable probabilities
13 0.068869166 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
14 0.067541689 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
15 0.066529483 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
16 0.065814175 283 nips-2011-The Fixed Points of Off-Policy TD
17 0.064201333 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
18 0.06260585 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
19 0.061852153 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
20 0.061165433 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
topicId topicWeight
[(0, 0.129), (1, -0.138), (2, 0.071), (3, 0.187), (4, -0.189), (5, 0.012), (6, -0.009), (7, 0.012), (8, -0.026), (9, -0.117), (10, 0.01), (11, -0.025), (12, -0.081), (13, -0.072), (14, -0.137), (15, -0.094), (16, 0.118), (17, 0.129), (18, 0.072), (19, -0.023), (20, 0.069), (21, -0.084), (22, 0.088), (23, -0.015), (24, 0.101), (25, 0.018), (26, -0.054), (27, 0.069), (28, -0.0), (29, -0.007), (30, -0.003), (31, 0.033), (32, -0.03), (33, -0.003), (34, 0.032), (35, 0.073), (36, -0.016), (37, 0.015), (38, -0.018), (39, 0.014), (40, 0.033), (41, 0.141), (42, -0.024), (43, -0.113), (44, -0.047), (45, 0.053), (46, -0.017), (47, 0.029), (48, 0.074), (49, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.97077882 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
2 0.82531101 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
3 0.71315789 215 nips-2011-Policy Gradient Coagent Networks
Author: Philip S. Thomas
Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1
4 0.69201702 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
Author: Paul Wagner
Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1
5 0.65467983 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments
Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern
Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper defines a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both offline and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1
6 0.64104962 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
7 0.61207533 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation
8 0.56462574 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
9 0.56361717 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
10 0.53216869 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
11 0.48281062 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
12 0.43433344 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
13 0.42237568 158 nips-2011-Learning unbelievable probabilities
14 0.41284609 256 nips-2011-Solving Decision Problems with Limited Information
15 0.39885339 52 nips-2011-Clustering via Dirichlet Process Mixture Models for Portable Skill Discovery
16 0.35245261 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding
17 0.34549025 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
18 0.34046659 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
19 0.33872026 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables
20 0.33513924 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
topicId topicWeight
[(0, 0.016), (4, 0.043), (20, 0.022), (26, 0.013), (31, 0.093), (33, 0.023), (34, 0.01), (37, 0.065), (43, 0.027), (45, 0.05), (57, 0.03), (70, 0.376), (74, 0.042), (83, 0.024), (99, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.73500073 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
2 0.69385719 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng
Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1
3 0.49360228 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
4 0.37451527 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions
Author: Zhan Lim, Lee Sun, Daniel J. Hsu
Abstract: POMDP planning faces two major computational challenges: large state spaces and long planning horizons. The recently introduced Monte Carlo Value Iteration (MCVI) can tackle POMDPs with very large discrete state spaces or continuous state spaces, but its performance degrades when faced with long planning horizons. This paper presents Macro-MCVI, which extends MCVI by exploiting macro-actions for temporal abstraction. We provide sufficient conditions for Macro-MCVI to inherit the good theoretical properties of MCVI. Macro-MCVI does not require explicit construction of probabilistic models for macro-actions and is thus easy to apply in practice. Experiments show that Macro-MCVI substantially improves the performance of MCVI with suitable macro-actions. 1
5 0.3624562 48 nips-2011-Blending Autonomous Exploration and Apprenticeship Learning
Author: Thomas J. Walsh, Daniel K. Hewlett, Clayton T. Morrison
Abstract: We present theoretical and empirical results for a framework that combines the benefits of apprenticeship and autonomous reinforcement learning. Our approach modifies an existing apprenticeship learning framework that relies on teacher demonstrations and does not necessarily explore the environment. The first change is replacing previously used Mistake Bound model learners with a recently proposed framework that melds the KWIK and Mistake Bound supervised learning protocols. The second change is introducing a communication of expected utility from the student to the teacher. The resulting system only uses teacher traces when the agent needs to learn concepts it cannot efficiently learn on its own. 1
6 0.3355858 75 nips-2011-Dynamical segmentation of single trials from population neural data
7 0.33116114 66 nips-2011-Crowdclustering
8 0.33087575 229 nips-2011-Query-Aware MCMC
9 0.33072969 249 nips-2011-Sequence learning with hidden units in spiking neural networks
10 0.33026373 102 nips-2011-Generalised Coupled Tensor Factorisation
11 0.32936084 180 nips-2011-Multiple Instance Filtering
12 0.32870001 158 nips-2011-Learning unbelievable probabilities
13 0.32851717 241 nips-2011-Scalable Training of Mixture Models via Coresets
14 0.32766041 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search
15 0.32678774 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
16 0.3262932 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
17 0.32604063 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
18 0.32591817 79 nips-2011-Efficient Offline Communication Policies for Factored Multiagent POMDPs
19 0.32396162 303 nips-2011-Video Annotation and Tracking with Active Learning
20 0.32261541 221 nips-2011-Priors over Recurrent Continuous Time Processes