nips nips2008 nips2008-37 knowledge-graph by maker-knowledge-mining

37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

Source: pdf

Author: Marek Petrik, Bruno Scherrer

Abstract: Most algorithms for solving Markov decision processes rely on a discount factor, which ensures their convergence. It is generally assumed that using an artiﬁcially low discount factor will improve the convergence rate, while sacriﬁcing the solution quality. We however demonstrate that using an artiﬁcially low discount factor may signiﬁcantly improve the solution quality, when used in approximate dynamic programming. We propose two explanations of this phenomenon. The ﬁrst justiﬁcation follows directly from the standard approximation error bounds: using a lower discount factor may decrease the approximation error bounds. However, we also show that these bounds are loose, thus their decrease does not entirely justify the improved solution quality. We thus propose another justiﬁcation: when the rewards are received only sporadically (as in the case of Tetris), we can derive tighter bounds, which support a signiﬁcant improvement in the solution quality with a decreased discount factor. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract Most algorithms for solving Markov decision processes rely on a discount factor, which ensures their convergence. [sent-7, score-0.829]

2 It is generally assumed that using an artiﬁcially low discount factor will improve the convergence rate, while sacriﬁcing the solution quality. [sent-8, score-0.97]

3 We however demonstrate that using an artiﬁcially low discount factor may signiﬁcantly improve the solution quality, when used in approximate dynamic programming. [sent-9, score-1.08]

4 The ﬁrst justiﬁcation follows directly from the standard approximation error bounds: using a lower discount factor may decrease the approximation error bounds. [sent-11, score-1.438]

5 We thus propose another justiﬁcation: when the rewards are received only sporadically (as in the case of Tetris), we can derive tighter bounds, which support a signiﬁcant improvement in the solution quality with a decreased discount factor. [sent-13, score-1.086]

6 One such important parameter of MDPs is the discount factor γ. [sent-16, score-0.97]

7 The motivation for the discount factor originally comes from economic models, but has often no meaning in reinforcement learning problems. [sent-18, score-0.998]

8 In addition to regularizing the rewards, using an artiﬁcially low discount factor sometimes has a signiﬁcant effect on the performance of the approximate algorithms. [sent-22, score-1.025]

9 The natural discount factor in Tetris is 1, since the received rewards have the same importance, independently of when received. [sent-24, score-1.207]

10 Our results, depicted in Figure 1, with approximate value iteration and standard features [1] show that setting the discount factor to γ ∈ (0. [sent-26, score-1.128]

11 That is ﬁve times the performance with discount factor of γ = 1 (about 4000). [sent-29, score-0.97]

12 88) is surprising, since computing a policy for this discount factor dramatically improves the return calculated with γ = 1. [sent-32, score-1.143]

13 0 20000 15000 10000 5000 0 0 10 20 30 40 50 60 70 80 90 100 Iterations Figure 1: Performance of approximate value iteration on Tetris with different discount factors. [sent-39, score-0.922]

14 In this paper, we study why using a lower discount factor improves the quality of the solution with regard to a higher discount factor. [sent-41, score-1.835]

15 In Section 3 we analyze the inﬂuence of the discount factor on the standard approximation error bounds [2]. [sent-43, score-1.245]

16 Then in Section 4 we argue that, in the context of this paper, the existing approximation error bounds are loose. [sent-44, score-0.275]

17 Though these bounds may be tightened by a lower discount factor, they are not sufﬁcient to explain the improved performance. [sent-45, score-0.979]

18 In particular, the rewards in Tetris are received sparsely, unlike the approximation error, which makes the value function less sensitive to the discount factor than the approximation error. [sent-47, score-1.439]

19 2 Framework and Notations In this section we formalize the problem of adjusting the discount factor in approximate dynamic programming. [sent-48, score-1.08]

20 Tetris does not directly ﬁt in this class, since its natural discount factor is 1. [sent-50, score-0.97]

21 We therefore treat Tetris as a discounted problem with a discount factor γ ∗ < 1 near one. [sent-53, score-0.988]

22 For sake of simplicity, we also assume that the rewards are non-negative; our analysis can be extended to arbitrary rewards in a straight-forward way. [sent-59, score-0.388]

23 Given a Markov decision process (S, A, P, r) and some discount factor γ, the objective is to ﬁnd a policy, i. [sent-61, score-0.993]

24 We consider in this paper that the MDP is solved with 1) an approximate dynamic programming algorithm and 2) a different discount factor β < γ. [sent-70, score-1.12]

25 In particular, our analysis applies to approximate value and policy iteration with existing error bounds. [sent-71, score-0.375]

26 Then, πβ is a policy greedy with regard to ˜ the approximate value function vβ . [sent-73, score-0.275]

27 ˜ As we have two different discount factors, we use a subscript to denote the discount factor used in π calculating the value. [sent-74, score-1.76]

28 We use vδ to represent the value of policy π calculated with the discount factor δ; when π is the optimal policy corresponding to the discount δ, we will simply denote its value vδ . [sent-76, score-2.116]

29 As mentioned above, our objective is to compare, π for the discount factor γ, the value vγ of the optimal policy and the value vγ β . [sent-77, score-1.187]

30 π From the optimality of vγ , vγ ≥ vγ β and from the non-negativity of the rewards, it is easy to show π π that the value function is monotonous with respect to the discount factor, and therefore: vγ β ≥ vβ β . [sent-82, score-0.836]

31 π where ed (β) := vγ − vβ ∞ denotes the discount error, and ea (β) := vβ − vβ β ∞ the approximation error. [sent-84, score-1.031]

32 In other words, a bound of the loss due to using πβ instead of the optimal policy for discount factor γ is the sum of the error on the optimal value function due to the change of discount and the error due to the approximation for discount β. [sent-85, score-3.173]

33 3 Error Bounds In this section, we develop a discount error bound and overview the existing approximation error bounds. [sent-87, score-1.208]

34 We also show how these bounds motivate decreasing the discount factor in the majority of MDPs. [sent-88, score-1.059]

35 The discount error due to using a discount factor β instead of γ is: ed (β) = vγ − vβ ∞ ≤ γ−β r (1 − β)(1 − γ) ∞. [sent-91, score-1.864]

36 Let Lγ and Lβ be the Bellman operators for the corresponding discount factors. [sent-93, score-0.79]

37 We have now: vγ − vβ ∞ = ≤ Lγ vγ − Lβ vβ Lγ vγ − Lβ vγ = Lγ vγ − Lβ vγ + Lβ vγ − Lβ vβ ∞ ∞ + Lβ vγ − Lβ vβ ∞ ≤ Lγ vγ − Lβ vγ ∞ ∞ + β vγ − vβ ∞ Let Pγ , rγ and Pβ , rβ be the transition matrices and rewards of policies greedy with regard to vγ for γ and β respectively. [sent-94, score-0.299]

38 This bound is trivially tight, that is there exists a problem for which the bound reduces to equality. [sent-98, score-0.258]

39 Approximation Error Bound We now discuss the dependence of the approximation error ea (β) on the discount factor β. [sent-118, score-1.331]

40 Approximate dynamic programming algorithms like approximate value and policy iteration build a sequence k ˜k of value functions (˜β )k≥0 with πβ being the policy greedy with respect to vβ . [sent-119, score-0.558]

41 These algorithms vk are approximate because at each iteration the value vβ is an approximation of some target value ˜k k vβ , which is hard to compute. [sent-120, score-0.263]

42 2 for policy iteration) bounds the loss of using the policies πβ instead of the optimal policy: 2β πk k lim sup vβ − vβ β ∞ ≤ sup vβ − vβ ∞ . [sent-125, score-0.287]

43 (1) depends on the discount factor, we need to bound the one-step k approximation error vβ − vβ in terms of β. [sent-127, score-1.104]

44 There exists ∈ (0, 1/2), such that for all k, the single-step approximation error is bounded by: k r ∞. [sent-130, score-0.264]

45 Alternatively to Assumption 4, we could assume that the approximation error is constant k in the discount factor β, i. [sent-133, score-1.175]

46 (1) that the approximation error ea is bounded as: 2β ea (β) ≤ r ∞. [sent-142, score-0.504]

47 2 Global Error Bound Using the results above, and considering that Assumption 4 holds, the cumulative error bound when using approximate dynamic programming with a discount factor β < γ is: γ−β 2β e(β) = ea (β) + ed (β) ≤ r ∞+ r ∞. [sent-144, score-1.493]

48 (1 − β)(1 − γ) (1 − β)3 An example of this error bound is shown in Figure 2: the bound is minimized for β 0. [sent-145, score-0.322]

49 This is because the approximation error decreases rapidly in comparison with the increasing discount error. [sent-147, score-1.012]

50 If the approximation factor introduced in Assumption 4 is sufﬁciently large, precisely if > (1 − γ)2 /2(1 + 2γ), then the best error bound e(β) will be achieved for the discount factor β = (2 + 1) − (2 + 1)2 + (2 − 1) < γ. [sent-150, score-1.464]

51 4 Bound Tightness We show in this section that the bounds on the approximation error ea (β) are very loose for β → 1 and thus the analysis above does not fully explain the improved performance. [sent-162, score-0.513]

52 In particular, there exists a naive bound on the approximation error that is dramatically tighter than the standard bounds when β is close to 1. [sent-163, score-0.461]

53 The functions may depend on the discount factor, but we omit that to simplify the notation. [sent-168, score-0.79]

54 Lemma 7 implies that for every MDP, there exists a discount factor β, such that Eq. [sent-171, score-1.01]

55 Consider even that the single-step approximation error is bounded by a constant, such that k lim supk→∞ vβ − vβ ∞ ≤ . [sent-173, score-0.258]

56 Such a bound implies that: ea (β) ≤ 2β /(1 − β)2 . [sent-175, score-0.249]

57 Thus we have that there exists β < 1 for which the standard approximation error bounds are loose, whenever > 0. [sent-177, score-0.315]

58 The looseness of the bound will be more apparent in problems with high discount factors. [sent-178, score-0.974]

59 For example in the MDP formulation of Blackjack [5] the discount factor γ = 0. [sent-179, score-0.97]

60 999, in which case the error bound may overestimate the true error by a factor up to 1/(1 − γ) = 1000. [sent-180, score-0.513]

61 The looseness of the approximation error bounds may seem to contradict Example 6. [sent-181, score-0.366]

62 8 150 100 200 || a − b ||∞ Bellman error Bellman error / true error 200 150 50 0 0. [sent-185, score-0.312]

63 ﬁxed rewards and number of states, while the example in [2] assumes that the reward depends on the discount factor and the number of states is potentially inﬁnite. [sent-204, score-1.276]

64 4 shows that for any discount factor β there exists an MDP (which depends on β) for which the bound Eq. [sent-206, score-1.119]

65 We, on the other hand, show that there does not exist a ﬁxed MDP such that for all discount factor β the bound Eq. [sent-208, score-1.079]

66 Proposition 6 justiﬁes the improved performance with a lower discount factor by a more rapid decrease in ea with β than the increase in ed . [sent-210, score-1.21]

67 The naive bound from Lemma 7 however shows that ea may scale with 1/(1 − β), the same as ed . [sent-211, score-0.249]

68 As a result, while the approximation error will decrease, it may not be sufﬁcient to offset the increase in the discount error. [sent-212, score-1.034]

69 Some of the standard approximation error bound may be tightened by using a lower discount factor. [sent-213, score-1.185]

70 For example consider the standard a-posteriori approximation error bound for the value function vβ [7] : ˜ 1 π ˜ vβ − vβ ∞ ≤ Lβ vβ − vβ ∞ , ˜ ˜ 1−β where πβ is greedy with respect to vβ . [sent-214, score-0.367]

71 The approximation error bound scales with (1−γ)2 , while the true 1 error scales with 1−γ . [sent-218, score-0.45]

72 999, the bound is 1000 times the true error value in this example. [sent-220, score-0.243]

73 The intuitive reason for the looseness of the bound is that the bound treats each state as recurrent, even when is it transient. [sent-221, score-0.308]

74 The global error bound may be also tightened by using a lower discount factor β as follows: π ˜ vγ − vγ β ∞ ≤ 1 Lβ vβ − vβ ˜ ˜ 1−β ∞ + γ−β r (1 − β)(1 − γ) ∞. [sent-222, score-1.281]

75 Finding the discount factor β that minimizes this error is difﬁcult, because the function may not be convex or differentiable. [sent-223, score-1.074]

76 The global error bound the MDP example above is depicted in Figure 5. [sent-225, score-0.256]

77 5 Sparse Rewards In this section, we propose an alternative explanation for the performance improvement in Tetris that does not rely on the loose approximation error bounds. [sent-226, score-0.283]

78 A speciﬁc property of Tetris is that the rewards are not received in every step, i. [sent-227, score-0.237]

79 As a result, the return should be less sensitive to the discount factor than the approximation error. [sent-231, score-1.087]

80 Decreasing the discount factor will thus reduce the approximation error more signiﬁcantly than it increases the discount error. [sent-232, score-1.965]

81 It is possible to prove that the discount error scales with a coefﬁcient that is lower than in Theorem 2: Theorem 10. [sent-253, score-0.935]

82 Consider π be the optimal policy for the discount factor γ. [sent-257, score-1.127]

83 Intuitively, the proof is based on “moving” the rewards to earlier steps to obtain a regular rewards structure. [sent-260, score-0.388]

84 To make this model identical to the original 1 formulation, we change the discount factor to γ 2 . [sent-274, score-0.97]

85 5 ) + ρ, Notice that ρ is a constant; it is independent of the new discount factor β. [sent-279, score-0.97]

86 The sparse rewards property can now be used to motivate the performance increase, even if the approximation error is bounded by /(1 − β) instead of by /(1 − β)3 (as Lemma 7 suggests). [sent-280, score-0.453]

87 The approximation error bound will not, in most cases, satisfy the sparsity assumption, as the errors are typically distributed almost uniformly over the state space and is received in every step as a result. [sent-281, score-0.387]

88 Therefore, for sparse rewards, the discount error increase will typically be offset by the larger decrease in the approximation error. [sent-282, score-1.083]

89 The cumulative error bounds derived above predict it is beneﬁcial to reduce the discount factor to β when: (γ 2. [sent-283, score-1.181]

90 5 ) 1−β 1−γ The effective discount factor γ ∗ in Tetris is not known, but consider for example that it is γ ∗ = 0. [sent-288, score-0.97]

91 We show the comparison of the effects of a lower discount factor of these two examples in Figure 6. [sent-298, score-0.995]

92 The dotted line represents the global error with sparse rewards, and the solid line represents the cumulative error with dense rewards. [sent-299, score-0.261]

93 Sparsity of rewards makes a decrease of the discount factor more interesting. [sent-300, score-1.197]

94 6 Conclusion and Future Work We show in this paper that some common approximation error bounds may be tightened with a lower discount factor. [sent-301, score-1.146]

95 We also identiﬁed a class of problems in which a lower discount factor is likely to increase the performance of approximate dynamic programming algorithms. [sent-302, score-1.166]

96 In particular, these are problems in which the rewards are received relatively sparsely. [sent-303, score-0.237]

97 We concentrated on a theoretical analysis of the inﬂuence of the discount factor, not on the speciﬁc methods which could be used to determine a discount factor. [sent-304, score-1.58]

98 The actual dependence of the performance on the discount factor may be non-trivial, and therefore hard to predict based on simple bounds. [sent-305, score-1.003]

99 Therefore, the most practical approach is to ﬁrst predict an improving discount factor based on the theoretical predictions, and then use line search to ﬁnd a discount factor that ensures good performance. [sent-306, score-1.957]

100 This is possible since the discount factor is a single-dimensional variable with a limited range. [sent-307, score-0.97]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discount', 0.79), ('tetris', 0.28), ('rewards', 0.194), ('factor', 0.18), ('ea', 0.14), ('policy', 0.139), ('bellman', 0.114), ('im', 0.113), ('bound', 0.109), ('jm', 0.106), ('error', 0.104), ('approximation', 0.101), ('mdp', 0.093), ('reward', 0.088), ('looseness', 0.075), ('bounds', 0.07), ('loose', 0.06), ('tightened', 0.056), ('dynamic', 0.055), ('approximate', 0.055), ('ri', 0.049), ('iz', 0.049), ('iteration', 0.047), ('proposition', 0.045), ('received', 0.043), ('programming', 0.04), ('exists', 0.04), ('injective', 0.04), ('dimitri', 0.037), ('petrik', 0.037), ('lim', 0.034), ('horizon', 0.034), ('decrease', 0.033), ('inductive', 0.032), ('cially', 0.032), ('value', 0.03), ('bertsekas', 0.03), ('undiscounted', 0.03), ('lemma', 0.029), ('remark', 0.029), ('transition', 0.028), ('amherst', 0.028), ('regard', 0.028), ('reinforcement', 0.028), ('assumption', 0.027), ('policies', 0.026), ('depicted', 0.026), ('lower', 0.025), ('action', 0.025), ('piece', 0.024), ('states', 0.024), ('decision', 0.023), ('greedy', 0.023), ('markov', 0.022), ('quality', 0.022), ('improved', 0.021), ('scienti', 0.021), ('tj', 0.021), ('log', 0.021), ('increase', 0.021), ('mdps', 0.021), ('games', 0.021), ('cumulative', 0.02), ('bounded', 0.019), ('motivate', 0.019), ('massachusetts', 0.019), ('justi', 0.019), ('tighter', 0.019), ('discounted', 0.018), ('optimal', 0.018), ('improvement', 0.018), ('maps', 0.018), ('offset', 0.018), ('dramatically', 0.018), ('decreases', 0.017), ('ti', 0.017), ('explain', 0.017), ('predict', 0.017), ('global', 0.017), ('holds', 0.016), ('supk', 0.016), ('contradict', 0.016), ('lar', 0.016), ('lihong', 0.016), ('monotonous', 0.016), ('overestimate', 0.016), ('sergey', 0.016), ('squaring', 0.016), ('tightens', 0.016), ('transiting', 0.016), ('warren', 0.016), ('sparse', 0.016), ('scales', 0.016), ('dependence', 0.016), ('uence', 0.016), ('processes', 0.016), ('return', 0.016), ('state', 0.015), ('randomized', 0.015), ('sparsity', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

Author: Marek Petrik, Bruno Scherrer

2 0.16486971 195 nips-2008-Regularized Policy Iteration

Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári

Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a ﬂexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efﬁcient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide ﬁnite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1

3 0.1622711 131 nips-2008-MDPs with Non-Deterministic Policies

Author: Mahdi M. Fard, Joelle Pineau

Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to ﬁnd the optimal policy for problems modelled as MDPs. Although ﬁnding the optimal policy is sufﬁcient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), ﬁnding all possible near-optimal policies might be useful as it provides more ﬂexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of ﬁnding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1

4 0.15640347 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

Author: Peter Auer, Thomas Jaksch, Ronald Ortner

Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1

5 0.1257761 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning

Author: Carmen Sandi, Wulfram Gerstner, Gediminas Lukšys

Abstract: Suppose we train an animal in a conditioning experiment. Can one predict how a given animal, under given experimental conditions, would perform the task? Since various factors such as stress, motivation, genetic background, and previous errors in task performance can inﬂuence animal behaviour, this appears to be a very challenging aim. Reinforcement learning (RL) models have been successful in modeling animal (and human) behaviour, but their success has been limited because of uncertainty as to how to set meta-parameters (such as learning rate, exploitation-exploration balance and future reward discount factor) that strongly inﬂuence model performance. We show that a simple RL model whose metaparameters are controlled by an artiﬁcial neural network, fed with inputs such as stress, affective phenotype, previous task performance, and even neuromodulatory manipulations, can successfully predict mouse behaviour in the ”hole-box” - a simple conditioning task. Our results also provide important insights on how stress and anxiety affect animal learning, performance accuracy, and discounting of future rewards, and on how noradrenergic systems can interact with these processes. 1

6 0.10593462 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

7 0.10389144 181 nips-2008-Policy Search for Motor Primitives in Robotics

8 0.093636096 223 nips-2008-Structure Learning in Human Sequential Decision-Making

9 0.089704931 40 nips-2008-Bounds on marginal probability distributions

10 0.073837243 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

11 0.071585186 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

12 0.066082142 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms

13 0.065640293 104 nips-2008-Improved Moves for Truncated Convex Models

14 0.063803628 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

15 0.062986217 235 nips-2008-The Infinite Hierarchical Factor Regression Model

16 0.061967097 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

17 0.059238762 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

18 0.057596084 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach

19 0.057309709 144 nips-2008-Multi-resolution Exploration in Continuous Spaces

20 0.051653959 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.139), (1, 0.235), (2, -0.07), (3, -0.066), (4, 0.072), (5, 0.011), (6, 0.063), (7, -0.079), (8, -0.118), (9, -0.016), (10, 0.023), (11, 0.018), (12, 0.077), (13, -0.032), (14, -0.03), (15, 0.001), (16, 0.052), (17, -0.022), (18, 0.021), (19, -0.057), (20, -0.024), (21, 0.064), (22, -0.032), (23, -0.1), (24, -0.071), (25, -0.001), (26, -0.035), (27, -0.01), (28, -0.075), (29, 0.021), (30, -0.022), (31, -0.004), (32, 0.053), (33, -0.008), (34, -0.055), (35, 0.028), (36, -0.058), (37, -0.064), (38, 0.049), (39, -0.041), (40, -0.054), (41, 0.043), (42, -0.139), (43, 0.039), (44, 0.029), (45, -0.019), (46, -0.007), (47, 0.066), (48, 0.132), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95203328 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

Author: Marek Petrik, Bruno Scherrer

2 0.77834606 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

Author: Peter Auer, Thomas Jaksch, Ronald Ortner

3 0.73808414 131 nips-2008-MDPs with Non-Deterministic Policies

Author: Mahdi M. Fard, Joelle Pineau

4 0.69060618 195 nips-2008-Regularized Policy Iteration

Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári

5 0.67414099 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms

Author: Jonathan Taylor, Doina Precup, Prakash Panagaden

Abstract: We deﬁne a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states deﬁned by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.

6 0.66749352 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

7 0.63859606 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework

8 0.57494909 181 nips-2008-Policy Search for Motor Primitives in Robotics

9 0.56870359 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

10 0.51158386 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach

11 0.44871074 222 nips-2008-Stress, noradrenaline, and realistic prediction of mouse behaviour using reinforcement learning

12 0.42092541 144 nips-2008-Multi-resolution Exploration in Continuous Spaces

13 0.41558149 40 nips-2008-Bounds on marginal probability distributions

14 0.38914776 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

15 0.38466269 235 nips-2008-The Infinite Hierarchical Factor Regression Model

16 0.36008644 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation

17 0.34895968 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

18 0.34539205 223 nips-2008-Structure Learning in Human Sequential Decision-Making

19 0.34481859 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor

20 0.33299875 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.069), (6, 0.056), (7, 0.071), (12, 0.027), (28, 0.171), (57, 0.054), (59, 0.013), (63, 0.046), (71, 0.049), (77, 0.071), (79, 0.208), (83, 0.03), (84, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82960045 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

Author: Marek Petrik, Bruno Scherrer

2 0.72385788 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning

Author: Peter Auer, Thomas Jaksch, Ronald Ortner

3 0.72243202 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation

Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any ﬁnite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to inﬁnity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not speciﬁc to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the ﬁrst algorithm to achieve all four desirable features and be stable and convergent for all ﬁnite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a ﬁnite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are ﬁnite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we ﬁrst consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The ﬁnal i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to deﬁne the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we deﬁne the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a ﬁxed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to inﬁnity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the ﬁrst expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any ﬁxed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the ﬁrst k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is deﬁned by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) preﬁxed by an estimate of the matrix A . Although we ﬁnd this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the ﬁrst expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the ﬁrst k − 1 samples, with u1 = 0. The GTD(0) algorithm is deﬁned by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-deﬁned because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-deﬁned as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the veriﬁcation of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-deﬁned for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisﬁes 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefﬁcient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisﬁes E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisﬁed by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the deﬁnition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-deﬁned and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the ﬁltration deﬁned by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redeﬁned as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a ﬁnite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that deﬁning the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is inﬁnite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exempliﬁed by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to ﬁnd the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufﬁcient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsiﬁcation methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the ﬁrst off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have conﬁrmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efﬁciently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientiﬁc, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artiﬁcial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artiﬁcial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8

4 0.71940327 29 nips-2008-Automatic online tuning for fast Gaussian summation

Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis

Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difﬁcult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the ﬁrst problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1

5 0.71916455 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework

Author: Matthew Botvinick, James An

Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.

6 0.71605337 195 nips-2008-Regularized Policy Iteration

7 0.70363075 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

8 0.70241469 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

9 0.69237447 181 nips-2008-Policy Search for Motor Primitives in Robotics

10 0.69061577 231 nips-2008-Temporal Dynamics of Cognitive Control

11 0.68999463 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

12 0.68964851 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

13 0.68767613 216 nips-2008-Sparse probabilistic projections

14 0.68759555 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms

15 0.68718749 131 nips-2008-MDPs with Non-Deterministic Policies

16 0.68515188 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks

17 0.68384683 62 nips-2008-Differentiable Sparse Coding

18 0.68378669 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

19 0.68247443 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

20 0.68238467 245 nips-2008-Unlabeled data: Now it helps, now it doesn't