nips nips2013 nips2013-347 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu † Abstract Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. [sent-7, score-0.144]
2 We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. [sent-10, score-0.4]
3 In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. [sent-11, score-0.426]
4 The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. [sent-12, score-0.129]
5 1 Introduction Markov Decision Processes (MDPs) have been widely used to model and solve sequential decision making problems under uncertainty, in fields including artificial intelligence, control, finance and management (Puterman, 2009, Barber, 2011). [sent-14, score-0.198]
6 In graph-based MDPs, the state is described by a collection of random variables, and the transition and reward functions are represented by a set of smaller (local-scope) functions. [sent-18, score-0.225]
7 Consequently, graph-based MDPs are often approximately solved by enforcing contextspecific independence or function-specific independence constraints (Sigaud et al. [sent-22, score-0.115]
8 To take advantage of context-specific independence, a graph-based MDP can be represented using decision trees or algebraic decision diagrams (Bahar et al. [sent-24, score-0.395]
9 , 1993), and then solved by applying structured value iteration (Hoey et al. [sent-25, score-0.158]
10 Exploiting function-specific independence, a graph-based MDP can be solved using approximate linear programming (Guestrin et al. [sent-31, score-0.149]
11 , 2003, 2001, Forsell and Sabbadin, 2006), approximate policy itera1 tion (Sabbadin et al. [sent-32, score-0.278]
12 , 2012, Peyrard and Sabbadin, 2006) and approximate value iteration (Guestrin et al. [sent-33, score-0.15]
13 Among these, the approximate linear programming algorithm in Guestrin et al. [sent-35, score-0.121]
14 The approximate policy iteration algorithm in Sabbadin et al. [sent-37, score-0.35]
15 In this paper, we propose a variational framework for the MDP planning problem. [sent-39, score-0.244]
16 This framework provides a new perspective to describe and solve graph-based MDPs where both the state and decision spaces are structured. [sent-40, score-0.168]
17 We first derive a variational value iteration algorithm as an exact planning algorithm, which is equivalent to the classical value iteration algorithm. [sent-41, score-0.388]
18 We then design an approximate version of this algorithm by taking advantage of the factored representation of the reward and transition functions, and propose a factored variational value iteration algorithm. [sent-42, score-0.733]
19 At each step, this algorithm computes the value function by minimizing a Kullback-Leibler divergence, which can be done using a belief propagation algorithm for influence diagram problems (Liu and Ihler, 2012) . [sent-44, score-0.126]
20 In comparison with the approximate linear programming algorithm (Guestrin et al. [sent-45, score-0.121]
21 , 2003) and the approximate policy iteration algorithm (Sabbadin et al. [sent-46, score-0.35]
22 , 2012) on various graph-based MDPs, we show that our factored variational value iteration algorithm generates better policies. [sent-47, score-0.359]
23 Section 3 describes a variational view of planning for finite horizon MDPs, followed by a framework for infinite MDPs in Section 4. [sent-50, score-0.275]
24 In Section 5, we derive an approximate algorithm for solving infinite MDPs based on the variational perspective. [sent-51, score-0.182]
25 A policy of the system is a mapping from the states to the decisions π (x) : X → D so that π (x) tells the decision chosen by the system in state x. [sent-56, score-0.417]
26 We consider the case of an MDP with infinite horizon, in which the future rewards are discounted exponentially with a discount factor γ ∈ [0, 1]. [sent-58, score-0.131]
27 The task of the MDP is to choose the best stationary policy π ∗ (x) that maximizes the expected discounted reward on the infinite horizon. [sent-59, score-0.366]
28 The value function v ∗ (x) of the best policy π ∗ (x) then satisfies the following Bellman equation: v ∗ (x) = max π(x) T (y|x, π (x)) (R (x, π (x)) + γv ∗ (y)), (1) y∈X where v ∗ (x) = v ∗ (y) , ∀x = y. [sent-60, score-0.2]
29 The Bellman equation can be solved using stochastic dynamic programming algorithms such as value iteration and policy iteration, or linear programming algorithms (Puterman, 2009). [sent-61, score-0.386]
30 2 Graph-based MDPs We assume that the full state x can be represented as a collection of state variables xi , so that X is a Cartesian product of the domains of the xi : X = X1 × X2 × · · · × XN , and similarly for d: D = D1 × D2 × · · · × DN . [sent-63, score-0.208]
31 We consider the following particular factored form for MDPs: for each variable i, there exist neighborhood sets Γi (including i) such that the value of xt+1 depends only i on the variable i’s neighborhood, xt [Γi ], and the ith decision dt . [sent-64, score-0.326]
32 We also assume that the reward function is the sum of N local-scope rewards: d d d N x d x T y | x, d x x d R (x, d) = d x T y | x, d Ri (xi , di ), (3) T y | x,i=1 d x y x x with local-scope functions Ri : Xi × Di → R, ∀i ∈ x 2, . [sent-70, score-0.33]
33 {1, T y | x, d T y | x, d T y | x, d y x x To summarize, a graph-based Markov decision process is x characterized by the following parameters: ({Xi x:, d1 ≤ i ≤ N } R {Di : 1 ≤ i ≤ R x,} ; {Ri : 1 ≤ i ≤ N } ; {Γi : 1 ≤ i ≤ N } ; {Ti : 1 ≤ i ≤ N }) . [sent-74, score-0.144]
34 r The optimal policy π (x) cannot be explicitly represented for large graph-based MDPs, since the number of states grows exponentially with the number of variables. [sent-77, score-0.256]
35 To reduce complexity, we consider a particular class of local policies: a policy π (x) : X → D is said to be local if decision di is made using only the neighborhood Γi , so that π (x) = (π1 (x [Γ1 ]) , π2 (x [Γ2 ]) , . [sent-78, score-0.638]
36 The main advantage of local policies is that they can be concisely expressed when the neighborhood sizes |Γi | are small. [sent-82, score-0.126]
37 3 Variational Planning for Finite Horizon MDPs In this section, we introduce a variational planning viewpoint of finite MDPs. [sent-83, score-0.244]
38 A finite MDP can be viewed as an influence diagram; we can then directly relate planning to the variational decisionmaking framework of Liu and Ihler (2012). [sent-84, score-0.244]
39 Influence diagrams (Shachter, 2007) make use of Bayesian networks to represent structured decision problems under uncertainty. [sent-85, score-0.221]
40 The shaded part in Figure 1(a) shows a simple example influence diagram, with random variables {x, y}, decision variable d and reward functions {R (x, d) , v (y)}. [sent-86, score-0.274]
41 The goal is then to choose a policy that maximizes the expected reward. [sent-87, score-0.2]
42 The best policy π t (x) for a finite MDP can be computed using backward induction (Barber, 2011): v t−1 (x) = max π (x) T (y|x, π (x)) R (x, π (x)) + γv t (y) , (4) y∈X Let pt (x, y, d) = T (y|x, π (x)) (R (x, π (x)) + γv t (y)) be an augmented distribution (see, e. [sent-88, score-0.2]
43 Applying a variational framework for influence diagrams (Liu and Ihler, 2012, Theorem 3. [sent-91, score-0.185]
44 1), the optimal policy can be equivalently solved from the dual form of Eq. [sent-92, score-0.228]
45 (5); then from Liu and Ihler (2012), the optimal policy π t (x) is simply arg maxd τ t (d|x). [sent-97, score-0.247]
46 For finite MDPs with non-stationary policy, the best policy π t (x) and the value function v t−1 (x) can be obtained by solving Eq. [sent-102, score-0.2]
47 (a) The optimal policy can be obtained from τ t (x, y, d), as π t (x) = arg maxd τ t (d|x). [sent-106, score-0.247]
48 4 Variational Planning for Infinite Horizon MDPs Given the variational form of finite MDPs, we now construct a variational framework for infinite MDPs. [sent-117, score-0.27]
49 For an infinite MDP, we can simply consider a two-stage finite MDP with the variational form in Eq. [sent-123, score-0.135]
50 With τ ∗ being the optimal solution, we have (a) The optimal policy of the infinite MDP can be decoded as π ∗ (x) = arg maxd τ ∗ (d|x). [sent-128, score-0.247]
51 5 Approximate Variational Algorithms for Graph-based MDPs The framework in the previous section gives a new perspective on the MDP planning problem, but does not by itself simplify the problem or provide new solution methods. [sent-144, score-0.109]
52 For graph-based MDPs, the sizes of the full state and decision spaces are exponential in the number of variables. [sent-145, score-0.168]
53 (6), by exploiting the factorization structure of the transition function (2), the reward function (3) and the value function v (x). [sent-148, score-0.175]
54 Standard variational approximations take advantage of the multiplicative factorization of a distribution to define their approximations. [sent-149, score-0.135]
55 While our (unnormalized) distribution p (x, y, d) = exp[θ ∆ (x, y, d)] is structured, some of its important structure comes from additive factors, such as the local-scope reward functions Ri (xi , di ) in Eq. [sent-150, score-0.33]
56 Using this augmenting approach, the additive elements of the graph-based MDP are converted to ˜ multiplicative factors, that is Ri (xi , di ) → Ri (xi , di , a), and γv (x) → vγ (x, a). [sent-157, score-0.4]
57 In this way, the ˜ ∆ parameter θ of a graph-based MDP can be represented as N θ ∆ (x, y, d, a) = N ˜ log Ri (xi , di , a) + log vγ (y, a) . [sent-158, score-0.226]
58 ˜ log Ti (yi |x[Γi ], di ) + i=1 i=1 Now, p (x, y, d, a) = exp[θ ∆ (x, y, d, a)] has a representation in terms of a product of factors. [sent-159, score-0.2]
59 log Ti (yi |x[Γi ], di ) + θ (x, y, d, a) = i=1 i=1 Before designing the algorithms, we first construct a cluster graph (G; C; S) for the distribution exp[θ (x, y, d, a)], where C denotes the set of clusters and S is the set of separators. [sent-161, score-0.322]
60 ) We assign each decision node di to one cluster that contains di and its parents pa(i); clusters so assigned are called decision clusters A, while other clusters are called normal clusters R, so that C = {R, A}. [sent-163, score-0.852]
61 Using the structure of the cluster graph, θ can be decomposed into θ (x, y, d, a) = θck (xck , yck , dck , a), (8) τck (zck ) , (kl)∈S τskl (zskl ) (9) k∈C and the distribution τ is approximated as k∈C τ (x, y, d, a) = where zck = {xck , yck , dck , a}. [sent-164, score-0.882]
62 We now construct a reduced cluster graph over x from the full cluster graph, to serve as the approximating structure of the marginal τ (x). [sent-167, score-0.151]
63 We assume a factored representation for τ (x): τ (x) = τck (xck ) , (kl)∈S τskl (xskl ) k∈C (10) where the τck (xck ) is the marginal distribution of τck (zck ) on xck . [sent-168, score-0.388]
64 (10) also dictates a factored approximation of the value function v (x), because v (x) ≈ exp (Φ) τ (x). [sent-170, score-0.18]
65 Then, the constraint (7) reduces to a set of simpler constraints on the cliques of the cluster graph, ∆ θck (xck , yck , dck , a) = θck (xck , yck , dck , a) + log vck ,x (yck , a) , k ∈ C. [sent-172, score-0.666]
66 (11) Correspondingly, the constraint (6) can be approximated by ∆ θck , τck + Φ= k∈C Hck + k∈R H k∈D ck − Hskl , (12) (kl)∈S where Hck is the entropy of variables in cluster ck , Hck = H (xck , yck , dck , a; τ ) and Hck = H (xck , yck , dck , a; τ ) − H (dck |xck ; τ ). [sent-173, score-1.128]
67 ˆ 6 Experiments We perform experiments in two domains, disease management in crop fields and viral marketing, to evaluate the performance of our factored variational value iteration algorithm (FVI). [sent-182, score-0.736]
68 For comparison, we use approximate policy iteration algorithm (API) (Sabbadin et al. [sent-183, score-0.35]
69 , 2012), (a mean-field based policy iteration approach), and the approximate linear programming algorithm (ALP) (Guestrin et al. [sent-184, score-0.393]
70 To evaluate each algorithm’s performance, we obtain its approximate local policy, then compute the expected value of the policy using either exact evaluation (if feasible) or a sample-based 1 estimate (if not). [sent-186, score-0.279]
71 We then compare the expected reward U alg (x) = |X | x v alg (x) of each algorithm’s policy. [sent-187, score-0.18]
72 1 Disease Management in Crop Fields A graph-based MDP for disease management in crop fields was introduced in (Sabbadin et al. [sent-189, score-0.271]
73 The problem is then to choose the optimal stationary policy to maximize the expected discounted yield. [sent-195, score-0.236]
74 The topology of the fields is represented by an undirected graph, where each node represents one crop field. [sent-196, score-0.159]
75 Each crop field can be in one of three states: xi = 1 if it is uninfected and xi = 2 to xi = 3 for increasing degrees of infection. [sent-198, score-0.344]
76 The probability that a field moves from state xi to state xi + 1 with di = 1 is set to be n P = P (ε, p, ni ) = ε + (1 − ε) (1 − (1 − p) i ), where ε and p are parameters and ni is the number of the neighbors of i that are infected. [sent-199, score-0.382]
77 The reward function depends on each field’s state and local decision. [sent-201, score-0.186]
78 The maximal yield r > 1 is achieved by an uninfected, cultivated field; otherwise, the yield decreases linearly with the level of infection, from maximal reward r to minimal reward 1 + r/10. [sent-202, score-0.26]
79 Table 1: Local transition probabilities p xi |xN (i) , ai , for the disease management problem. [sent-204, score-0.245]
80 di = 1 di = 2 xi = 1 xi = 2 xi = 3 xi = 1 xi = 2 xi = 3 xi = 1 1 − P 0 0 1 q q/2 xi = 2 P 1−P 0 0 1−q q/2 xi = 3 0 P 1 0 0 1−q 6. [sent-205, score-1.003]
81 Viral marketing has been previously framed as a one-shot influence diagram problem (Nath and Domingos, 2010). [sent-207, score-0.177]
82 Here, we frame the viral marketing task as an MDP planning problem, where we optimize the stationary policy to maximize long-term reward. [sent-208, score-0.57]
83 We assume there are three states for each person in the social network: xi = 1 if i is making positive comments, xi = 2 if not commenting, and xi = 3 for negative comments. [sent-210, score-0.235]
84 There is a binary decision corresponding to each person i: market to this person (di = 1) or not (di = 2). [sent-211, score-0.212]
85 We also define a local reward function: if a person gives good comments when di = 2, then the reward is r; otherwise, the reward is less, decreasing linearly to minimum value 1 + r/10. [sent-212, score-0.681]
86 For marketed individuals (di = 1), the reward is 1. [sent-213, score-0.13]
87 The local transition p xi |xN (i) , di is set as in Table 1. [sent-214, score-0.344]
88 For models with 6 nodes, we also compute the expected reward under the optimal global policy π ∗ (x) for comparison. [sent-224, score-0.33]
89 Note that this value over∗ estimates the best possible local policy {πi (Γi (x))} being sought by the algorithms; the best local policy is usually much more difficult to compute due to imperfect recall. [sent-225, score-0.464]
90 Since the complexity of the approximate linear programming (ALP) algorithm is exponential in the treewidth of graph defined by the neighborhoods Γi , we were unable to compute results for models beyond treewidth 6. [sent-226, score-0.203]
91 The tables show that our factored variational value iteration (FVI) algorithm gives policies with higher expected rewards than those of approximate policy iteration (API) on the majority of models (156/196), over all sets of models and different p and q. [sent-227, score-0.807]
92 Compared to approximate linear programming, in addition to being far more scalable, our algorithm performed comparably, giving better policies on just over half of the models (53/96) that ALP could be run on. [sent-228, score-0.111]
93 The average results across all settings are shown in Table 5, along with the relative improvements of our factored variational value iteration algorithm to approximate policy iteration and approximate linear programming. [sent-231, score-0.725]
94 Table 5 shows that our FVI algorithm, compared to approximate policy iteration, gives the best policies on regular models across sizes, and gives better policies than those of the approximate linear programming on chain-like models with small size (6 and 10 nodes). [sent-232, score-0.498]
95 Although on average the approximate linear programming algorithm may provide better policies for “chain” models with large size, its exponential number of constraints makes it infeasible for general large-scale graph-based MDPs. [sent-233, score-0.154]
96 7 Conclusions In this paper, we have proposed a variational planning framework for Markov decision processes. [sent-234, score-0.388]
97 We used this framework to develop a factored variational value iteration algorithm that exploits the structure of the graph-based MDP to give efficient and accurate approximations, scales easily to large systems, and produces better policies than existing approaches. [sent-235, score-0.423]
98 Approximate linear-programming algorithms for graph-based markov e decision processes. [sent-618, score-0.183]
99 Mean field approximation of the policy iteration algorithm for graphe based markov decision processes. [sent-644, score-0.455]
100 Advances in decision analysis: from foundations to applications, pages 177–201, 2007. [sent-660, score-0.144]
wordName wordTfidf (topN-words)
[('mdps', 0.286), ('fvi', 0.284), ('zck', 0.252), ('ck', 0.249), ('xck', 0.236), ('alp', 0.232), ('api', 0.219), ('mdp', 0.204), ('di', 0.2), ('policy', 0.2), ('sabbadin', 0.197), ('dck', 0.161), ('factored', 0.152), ('decision', 0.144), ('viral', 0.137), ('variational', 0.135), ('ihler', 0.134), ('reward', 0.13), ('yck', 0.126), ('marketing', 0.124), ('planning', 0.109), ('crop', 0.107), ('guestrin', 0.089), ('disease', 0.079), ('iteration', 0.072), ('nath', 0.071), ('peyrard', 0.071), ('liu', 0.069), ('xi', 0.067), ('rewards', 0.065), ('policies', 0.064), ('skl', 0.063), ('ri', 0.059), ('hck', 0.058), ('qiang', 0.058), ('bellman', 0.057), ('cluster', 0.056), ('management', 0.054), ('fallow', 0.054), ('forsell', 0.054), ('diagram', 0.053), ('eld', 0.05), ('diagrams', 0.05), ('domingos', 0.05), ('decisions', 0.049), ('maxd', 0.047), ('approximate', 0.047), ('ti', 0.047), ('transition', 0.045), ('programming', 0.043), ('uence', 0.039), ('loop', 0.039), ('graph', 0.039), ('markov', 0.039), ('treewidth', 0.037), ('belief', 0.037), ('discounted', 0.036), ('propagation', 0.036), ('sigaud', 0.036), ('uninfected', 0.036), ('vck', 0.036), ('zskl', 0.036), ('daphne', 0.035), ('person', 0.034), ('regular', 0.033), ('wainwright', 0.033), ('koller', 0.033), ('local', 0.032), ('ronald', 0.032), ('gis', 0.032), ('parr', 0.032), ('infection', 0.032), ('bahar', 0.032), ('boutilier', 0.032), ('hoey', 0.032), ('horizon', 0.031), ('factors', 0.031), ('et', 0.031), ('exponentially', 0.03), ('neighborhood', 0.03), ('kl', 0.029), ('barber', 0.029), ('tsinghua', 0.029), ('elds', 0.029), ('exp', 0.028), ('independence', 0.028), ('solved', 0.028), ('arti', 0.028), ('clusters', 0.027), ('craig', 0.027), ('structured', 0.027), ('puterman', 0.026), ('intelligence', 0.026), ('topology', 0.026), ('represented', 0.026), ('nite', 0.025), ('alexander', 0.025), ('alg', 0.025), ('comments', 0.025), ('state', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 347 nips-2013-Variational Planning for Graph-based MDPs
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
2 0.20412378 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
3 0.18949383 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta
Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1
4 0.18405393 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Author: Paul Wagner
Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1
5 0.17622198 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
6 0.17428015 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
7 0.17008732 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
9 0.15329464 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
10 0.15273553 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
11 0.15108097 348 nips-2013-Variational Policy Search via Trajectory Optimization
12 0.14067467 241 nips-2013-Optimizing Instructional Policies
13 0.13834633 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
14 0.13641123 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
15 0.12922439 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
16 0.12804979 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
17 0.11757749 165 nips-2013-Learning from Limited Demonstrations
18 0.11205296 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.10731881 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
20 0.10694963 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
topicId topicWeight
[(0, 0.205), (1, -0.253), (2, -0.144), (3, 0.123), (4, -0.004), (5, 0.069), (6, 0.044), (7, 0.022), (8, 0.023), (9, 0.013), (10, 0.054), (11, 0.005), (12, 0.071), (13, 0.032), (14, 0.033), (15, 0.004), (16, 0.072), (17, -0.001), (18, 0.009), (19, -0.058), (20, 0.057), (21, 0.001), (22, 0.014), (23, 0.028), (24, -0.021), (25, -0.007), (26, -0.023), (27, -0.035), (28, -0.017), (29, 0.023), (30, 0.017), (31, 0.016), (32, -0.027), (33, 0.101), (34, 0.002), (35, 0.013), (36, 0.018), (37, 0.062), (38, -0.056), (39, 0.049), (40, -0.048), (41, -0.044), (42, -0.084), (43, 0.045), (44, -0.054), (45, 0.021), (46, 0.048), (47, 0.041), (48, -0.064), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.95165735 347 nips-2013-Variational Planning for Graph-based MDPs
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
2 0.80498421 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli
Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1
3 0.76781088 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
4 0.74821037 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
5 0.72210568 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos
Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1
6 0.71531314 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search
7 0.70909077 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
8 0.6904617 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
9 0.68506449 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods
10 0.68007392 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes
11 0.67761731 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
12 0.664478 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling
13 0.63617271 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
14 0.63222617 241 nips-2013-Optimizing Instructional Policies
15 0.60106695 348 nips-2013-Variational Policy Search via Trajectory Optimization
16 0.57172418 257 nips-2013-Projected Natural Actor-Critic
17 0.55358088 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
18 0.54448336 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
19 0.539424 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
20 0.53042012 165 nips-2013-Learning from Limited Demonstrations
topicId topicWeight
[(2, 0.011), (16, 0.039), (33, 0.08), (34, 0.162), (36, 0.012), (41, 0.034), (49, 0.043), (56, 0.084), (70, 0.02), (75, 0.312), (85, 0.065), (89, 0.018), (93, 0.029), (95, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.74873793 347 nips-2013-Variational Planning for Graph-based MDPs
Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler
Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1
2 0.69086069 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths
Author: Stefan Mathe, Cristian Sminchisescu
Abstract: Human eye movements provide a rich source of information into the human visual information processing. The complex interplay between the task and the visual stimulus is believed to determine human eye movements, yet it is not fully understood, making it difficult to develop reliable eye movement prediction systems. Our work makes three contributions towards addressing this problem. First, we complement one of the largest and most challenging static computer vision datasets, VOC 2012 Actions, with human eye movement recordings collected under the primary task constraint of action recognition, as well as, separately, for context recognition, in order to analyze the impact of different tasks. Our dataset is unique among the eyetracking datasets of still images in terms of large scale (over 1 million fixations recorded in 9157 images) and different task controls. Second, we propose Markov models to automatically discover areas of interest (AOI) and introduce novel sequential consistency metrics based on them. Our methods can automatically determine the number, the spatial support and the transitions between AOIs, in addition to their locations. Based on such encodings, we quantitatively show that given unconstrained read-world stimuli, task instructions have significant influence on the human visual search patterns and are stable across subjects. Finally, we leverage powerful machine learning techniques and computer vision features in order to learn task-sensitive reward functions from eye movement data within models that allow to effectively predict the human visual search patterns based on inverse optimal control. The methodology achieves state of the art scanpath modeling results. 1
3 0.62129897 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
Author: Tuo Zhao, Han Liu
Abstract: We propose a semiparametric method for estimating sparse precision matrix of high dimensional elliptical distribution. The proposed method calibrates regularizations when estimating each column of the precision matrix. Thus it not only is asymptotically tuning free, but also achieves an improved finite sample performance. Theoretically, we prove that the proposed method achieves the parametric rates of convergence in both parameter estimation and model selection. We present numerical results on both simulated and real datasets to support our theory and illustrate the effectiveness of the proposed estimator. 1
4 0.61838174 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator
Author: Benigno Uria, Iain Murray, Hugo Larochelle
Abstract: We introduce RNADE, a new model for joint density estimation of real-valued vectors. Our model calculates the density of a datapoint as the product of onedimensional conditionals modeled using mixture density networks with shared parameters. RNADE learns a distributed representation of the data, while having a tractable expression for the calculation of densities. A tractable likelihood allows direct comparison with other methods and training by standard gradientbased optimizers. We compare the performance of RNADE on several datasets of heterogeneous and perceptual data, finding it outperforms mixture models in all but one case. 1
5 0.61670697 166 nips-2013-Learning invariant representations and applications to face verification
Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio
Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1
6 0.55373251 122 nips-2013-First-order Decomposition Trees
7 0.55343378 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs
8 0.55050832 210 nips-2013-Noise-Enhanced Associative Memories
9 0.54607558 202 nips-2013-Multiclass Total Variation Clustering
10 0.54271919 143 nips-2013-Integrated Non-Factorized Variational Inference
11 0.54233652 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris
12 0.54184884 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
13 0.54117 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
14 0.54050219 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
15 0.53816015 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
16 0.53483599 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
17 0.53469813 86 nips-2013-Demixing odors - fast inference in olfaction
18 0.53384966 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
19 0.53228897 129 nips-2013-Generalized Random Utility Models with Multiple Types
20 0.531551 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models