nips nips2009 nips2009-134 knowledge-graph by maker-knowledge-mining

134 nips-2009-Learning to Explore and Exploit in POMDPs


Source: pdf

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This problem becomes more challenging when the agent can only partially observe the states of its environment. [sent-2, score-0.205]

2 In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. [sent-3, score-0.445]

3 The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). [sent-4, score-0.574]

4 The form of the employed exploration is dictated by the specific problem. [sent-5, score-0.187]

5 Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. [sent-6, score-0.248]

6 1 Introduction A fundamental challenge facing reinforcement learning (RL) algorithms is to maintain a proper balance between exploration and exploitation. [sent-8, score-0.269]

7 The policy designed based on previous experiences is by construction constrained, and may not be optimal as a result of inexperience. [sent-9, score-0.207]

8 Although these actions may not necessarily yield optimal near-term reward toward the ultimate goal, they could, over a long horizon, yield improved long-term reward. [sent-11, score-0.143]

9 For a Markov decision process (MDP), the problem of balancing exploration and exploitation has been addressed successfully by the E 3 [4, 5] and R-max [2] algorithms. [sent-13, score-0.324]

10 Reinforcement learning in POMDPs is challenging, particularly in the context of balancing exploration and exploitation. [sent-15, score-0.208]

11 exploitation problem is based on an augmented POMDP, with a product state space over the environment states and the unknown POMDP parameters [9]. [sent-17, score-0.153]

12 To mitigate this complexity, active learning methods have been proposed for POMDPs, which borrow similar ideas from supervised learning, and apply them to selectively query an oracle (domain expert) for the optimal action [3]. [sent-19, score-0.126]

13 In this paper we propose a dual-policy approach to balance exploration and exploitation in POMDPs, by simultaneously learning two policies with partially shared internal structure. [sent-21, score-0.372]

14 The first policy, termed the primary policy, defines actions based on previous experience; the second policy, termed 1 the auxiliary policy, is a meta-level policy maintaining a proper balance between exploration and exploitation. [sent-22, score-0.556]

15 We employ the regionalized policy representation (RPR) [6] to parameterize both policies, and perform Bayesian learning to update the policy posteriors. [sent-23, score-0.373]

16 The approach applies in either of two cases: (i) the agent explores by randomly taking the actions that have been insufficiently tried before (traditional exploration), or (ii) the agent explores by querying an oracle for the optimal action (active learning). [sent-24, score-0.572]

17 In the latter case, the agent is assessed a query cost from the oracle, in addition to the reward received from the environment. [sent-25, score-0.285]

18 Either (i) or (ii) is employed as an exploration vehicle, depending upon the application. [sent-26, score-0.201]

19 Another distinction is that our approach learns the agent policy directly from episodes, without estimating the POMDP model. [sent-29, score-0.354]

20 2 Regionalized Policy Representation We first provide a brief review of the regionalized policy representation, which is used to parameterize the primary policy and the auxiliary policy as discussed above. [sent-31, score-0.646]

21 1 A regionalized policy representation is a tuple (A, O, Z, W, µ, π). [sent-34, score-0.197]

22 Furthermore, the history-dependent distribution of action choices is obtained as follows: p(aτ |hτ , Θ) = p(a0:τ |o1:τ , Θ)[p(a0:τ −1 |o1:τ −1 , Θ)]−1 which gives a stochastic policy for choosing the action aτ . [sent-64, score-0.208]

23 , it breaks into subsequences called episodes [10], we represent the experiences by a set of episodes. [sent-70, score-0.176]

24 2 An episode is a sequence of agent-environment interactions terminated in an absorbing state that transits to itself with zero reward. [sent-72, score-0.117]

25 An episode is denoted by k k k (ak r0 ok ak r1 · · · ok k ak k rTk ), where the subscripts are discrete times, k indexes the episodes, and o, 0 1 1 T T a, and r are respectively observations, actions, and immediate rewards. [sent-73, score-0.353]

26 1 K V (D(K) ; Θ) = K k=1 Tk t=0 2 k γ t rt É É t k k τ =0 p(aτ |hτ ,Θ) t Π k k τ =0 p (aτ |hτ ) (2) where hk = ak ok ak · · · ok is the history of actions and observations up to time t in the k-th episode, t 0 1 1 t 0 < γ < 1 is the discount, and Θ denotes the RPR parameters. [sent-77, score-0.398]

27 It is proven in [6] that limK→∞ V (D(K) ; Θ) is the expected sum of discounted rewards by following the RPR policy parameterized by Θ for an infinite number of steps. [sent-79, score-0.229]

28 In the Bayesian setting discussed below, we use a noninformative prior for Θ, leading to a posterior of Θ peaked at the optimal RPR, therefore the agent is guaranteed to sample the optimal or a near-optimal policy with overwhelming probability. [sent-81, score-0.426]

29 The episodes D(K) may be obtained from the environment by following an arbitrary stochastic policy Π with pΠ (a|h) > 0, ∀ a, ∀ h. [sent-100, score-0.361]

30 A good choice of Π avoids episodes that do not bring new information to improve the RPR, and thus the agent does not have to see all possible episodes before the RPR becomes optimal. [sent-102, score-0.504]

31 In batch learning, all episodes are collected before the learning begins, and thus Π is pre-chosen and does not change during the learning [6]. [sent-103, score-0.158]

32 In online learning, however, the episodes are collected during the learning, and the RPR is updated upon completion of each episode. [sent-104, score-0.186]

33 The agent should recognize belief regions it is familiar with, and exploit the existing RPR policy there; in belief regions inferred as new, the agent should explore. [sent-106, score-0.638]

34 This balance between exploration and exploitation is performed with the goal of accumulating a large long-run reward. [sent-107, score-0.336]

35 We consider online learning of the RPR (as the primary policy) and choose Π as a mixture of two policies: one is the current RPR Θ (exploitation) and the other is an exploration policy Πe . [sent-108, score-0.414]

36 This gives the action-choosing probability pΠ (a|h) = p(y = 0|h)p(a|h, Θ, y = 0)+p(y = 1|h)p(a|h, Πe , y = 1), where y = 0 (y = 1) indicates exploitation (exploration). [sent-109, score-0.116]

37 The problem of choosing good Π then reduces to a proper balance between exploitation and exploration: the agent should exploit Θ when doing so is highly rewarding, while following Πe to enhance experience and improve Θ. [sent-110, score-0.382]

38 An auxiliary RPR is employed to represent the policy for balancing exploration and exploitation, i. [sent-111, score-0.42]

39 The auxiliary RPR shares the parameters {µ, W } with the primary RPR, but with π = {π(z, a) : a ∈ A, z ∈ Z} replaced by λ = {λ(z, y) : y = 0 or 1, z ∈ Z}, where λ(z, y) is the probability of choosing exploitation (y = 0) or exploration (y = 1) in belief region z. [sent-114, score-0.441]

40 (6) In order to encourage exploration when the agent has little experience, we choose u0 = 1 and u1 > 1 so that, at the beginning of learning, the auxiliary RPR always suggests exploration. [sent-116, score-0.421]

41 As the agent accumulates episodes of experience, it comes to know a certain part of the environment in which the episodes have been collected. [sent-117, score-0.541]

42 This knowledge is reflected in the auxiliary RPR, which, along with the primary RPR, is updated upon completion of each new episode. [sent-118, score-0.135]

43 However, the agent cannot directly access the belief states, because computation of belief states requires knowing the true POMDP model, which is not available. [sent-120, score-0.25]

44 Within the RPR formulation, H is represented internally as the set of distributions over belief regions z ∈ Z, which allows the agent to access H based on a subset of samples from H. [sent-122, score-0.229]

45 , the primary RPR is optimal in Hknown and thus the agent should begin to exploit upon entering Hknown . [sent-125, score-0.3]

46 We have mk = rmeta if the goal is reached at time t in t episode k, and mk = 0 otherwise, where rmeta > 0 is a constant. [sent-127, score-0.149]

47 Thus, the k probability of exploration is reduced each time the agent makes a query to the oracle (i. [sent-129, score-0.444]

48 After a certain number of queries, ui,1 becomes the small positive number η (it never becomes zero due to the max operator), at which point the agent stops querying in belief region z = i. [sent-132, score-0.246]

49 In (7) and (8), exploitation always receives a “credit”, while exploration never receives credit (exploration is actually discredited when Πe is an oracle). [sent-133, score-0.378]

50 This update makes sure that the chance of exploitation monotonically increases as the episodes accumulate. [sent-134, score-0.296]

51 Exploration receives no credit because it has been pre-assigned a credit (u1 ) in the prior, and the chance of exploration should monotonically decrease with the accumulation of episodes. [sent-135, score-0.27]

52 When c > 0, u1 is discredited by the cost and the agent needs a larger u1 (than when c = 0) to obtain the same amount of exploration. [sent-137, score-0.213]

53 The fact that the amount of exploration monotonically increases with u1 implies that, one can always find a large enough u1 to ensure that the primary RPR is optimal in Hknown = {h : p(y = 0|h, Θ, λ) ≈ 1}. [sent-138, score-0.283]

54 However, an unnecessarily large u1 makes the agent over-explore and leads to slow convergence. [sent-139, score-0.188]

55 The condition y0:t = 0, which is an an abbreviation for yτ = 0 ∀ τ = 0, 1, · · · , t, (K) indicates that the agent always follows the RPR (Θ) here. [sent-145, score-0.207]

56 When T = ∞ 1 , the two are identical (K) up to a difference in acquiring the episodes: ET is a simple enumeration of distinct episodes (K) while D may contain identical episodes. [sent-147, score-0.158]

57 The multiplicity of an episode in D(K) results from the sampling process (by following a policy to interact with the environment). [sent-148, score-0.249]

58 Let (K) Pexlpore (ET , Θ, λ, Πe ) be the probability of executing the exploration policy Πe at least once in (K) some episode in ET , under the auxiliary RPR (Θ, λ) and the exploration policy Πe . [sent-155, score-0.845]

59 Rmax 1 An episode almost always terminates in finite time steps in practice and the agent stays in the absorbing state with zero reward for the remaining infinite steps after an episode is terminated [10]. [sent-157, score-0.474]

60 The infinite horizon is only to ensure theoretically all episodes have the same horizon length. [sent-158, score-0.186]

61 Let the auxiliary RPR hyper-parameters (λ) be updated according to (7) and (8), with u1 ≥ umin . [sent-161, score-0.122]

62 Then either (a) V (E∞ ; Θ) ≥ 1 V (E∞ ; Θ∗ ) − ǫ, or (b) the probability that the auxiliary RPR suggests executing Πe in some episode (K) unseen in E∞ is at least ǫ(1−γ) . [sent-163, score-0.139]

63 By construction, Vf (E∞ ; Θ, λ, Πe ) is an optimistic value function, in which the agent receives Rmax at any time t unless if yτ = 0 at τ = 0, 1, · · · , t. [sent-170, score-0.227]

64 However, yτ = 0 at τ = 0, 1, · · · , t implies that {hτ : τ = 0, 1, · · · , t} ⊂ Hknown , By the premise, λ is updated according to (7) and (8) and u1 ≥ umin , therefore Θ is optimal in Hknown 1 (see the discussions following (7) and (8)), which implies Θ is optimal in {hτ : τ = 0, 1, · · · , t}. [sent-171, score-0.122]

65 2 shows that whenever the primary RPR achieves less accumulative reward than the optimal RPR by ǫ, the auxiliary RPR suggests exploration with a probability exceeding −1 ǫ(1 − γ)Rmax . [sent-177, score-0.405]

66 Conversely, whenever the auxiliary RPR suggests exploration with a probability −1 smaller than ǫ(1 − γ)Rmax , the primary RPR achieves ǫ-near optimality. [sent-178, score-0.294]

67 This ensures that the agent is either receiving sufficient rewards or it is performing sufficient exploration. [sent-179, score-0.24]

68 The primary policy is a RPR with |Z| = 10 and a prior in (4), with all hyper-parameters initially set to one (which makes the initial prior non-informative). [sent-181, score-0.247]

69 The auxiliary policy is a RPR sharing {µ, W } with the primary RPR and having a prior for λ as in (6). [sent-182, score-0.283]

70 The prior of λ is initially biased towards exploration by using u0 = 1 and u1 > 1. [sent-183, score-0.197]

71 The agent performs online learning: upon termination of each new episode, the primary and auxiliary RPR posteriors are updated by using the previous posteriors as the current priors. [sent-185, score-0.334]

72 In each run, the agent starts learning from a random position in the environment and stops learning when Ktotal episodes are completed. [sent-189, score-0.399]

73 When performing exploitation, the agent follows the current primary RPR (using the Θ that maximizes the posterior); when performing exploration, it follows an exploration policy Πe . [sent-191, score-0.602]

74 We consider two types of Πe : (i) taking random actions and (ii) following the policy obtained by solving the true POMDP using PBVI [8] with 2000 belief samples. [sent-192, score-0.249]

75 In case (ii), the PBVI policy is the oracle and incurs a query cost c. [sent-195, score-0.244]

76 We report: (i) the sum of discounted rewards accrued within each episode during learning; these rewards result from both exploitation and exploration. [sent-196, score-0.355]

77 (iii) the exploration rate Pexplore in each learning episode, which is the number of time steps at which exploration is performed divided by the total time steps in 6 a given episode. [sent-198, score-0.392]

78 In order to examine the optimality, the rewards in (i)-(ii) has the corresponding optimal rewards subtracted, where the optimal rewards are obtained by following the PBVI policy; the difference are reported, with zero difference indicating optimality and minus difference indicating sub-optimality. [sent-199, score-0.249]

79 3 −10 −12 0 500 1000 1500 2000 2500 3000 Number of episodes used in the learning phase 0. [sent-210, score-0.179]

80 8 Exploration rate −6 Accrued testing reward minus optimal reward Accrued learning reward minus optimal reward −2 −4 −12 1 0 −2 RPR, Pexplore = 1. [sent-211, score-0.386]

81 0 0 0 500 1000 1500 2000 2500 3000 Number of episodes used in the learning phase 500 1000 1500 2000 2500 3000 Number of episodes used in the learning phase Figure 1: Results on Shuttle with a random exploration policy, with Ktotal = 3000. [sent-212, score-0.545]

82 Left: accumulative discounted reward accrued within each learning episode, with the corresponding optimal reward subtracted. [sent-213, score-0.249]

83 Middle: accumulative discounted rewards averaged over 251 episodes of following the primary RPR obtained after each learning episode, again with the corresponding optimal reward subtracted. [sent-214, score-0.393]

84 Right: the rate of exploration in each learning episode. [sent-215, score-0.187]

85 4 0 0 20 40 60 80 100 Number of episodes used in the learning phase 20 40 60 80 100 Number of episodes used in the learning phase 1 0 0 Dual−RPR, u1=10 0. [sent-224, score-0.358]

86 657 −20 0 20 40 60 80 100 Number of episodes used in the learning phase = 0. [sent-226, score-0.179]

87 158 RPR, P Exploration rate −4 Accrued testing reward minus optimal reward Accrued learning reward minus optimal reward −2 Dual−RPR, u1=2 −6 −8 Dual−RPR, u =2 1 Dual−RPR, u1=10 Dual−RPR, u1=20 RPR, Pexplore = 0. [sent-229, score-0.386]

88 0 20 40 60 80 100 Number of episodes used in the learning phase 0. [sent-233, score-0.179]

89 0 20 40 60 80 100 Number of episodes used in the learning phase Exploration rate −4 Accrued testing reward minus optimal reward Accrued learning reward minus optimal reward −2 Dual−RPR, u1=10 Dual−RPR, u1=20 0. [sent-238, score-0.565]

90 2 0 0 20 40 60 80 100 Number of episodes used in the learning phase Figure 2: Results on Shuttle with an oracle exploration policy incurring cost c = 1 (top row) and c = 3 (bottom row), and Ktotal = 100. [sent-241, score-0.59]

91 It is seen from Figure 1 that, with random exploration and u1 = 2, the primary policy converges to optimality and, accordingly, Pexplore drops to zero, after about 1500 learning episodes. [sent-244, score-0.454]

92 However, there are two special observations from Figure 2: (i) Pexplore is affected by the query cost c: with a larger c, the agent performs less exploration. [sent-252, score-0.217]

93 (ii) the convergence rate of the primary policy is not significantly affected by the query cost. [sent-253, score-0.261]

94 The reason for (ii) is that the oracle always provides optimal actions, thus over-exploration does not harm the optimality; as long as the agent takes optimal actions, the primary policy continually improves if it is not yet optimal, or it remains optimal if it is already optimal. [sent-254, score-0.533]

95 6 Conclusions We have presented a dual-policy approach for jointly learning the agent behavior and the optimal balance between exploitation and exploration, assuming the unknown environment is a POMDP. [sent-255, score-0.397]

96 By identifying a known part of the environment in terms of histories (parameterized by the RPR), the approach adaptively switches between exploration and exploitation depending on whether the agent is in the known part. [sent-256, score-0.553]

97 We have provided theoretical guarantees for the agent to either explore efficiently or exploit efficiently. [sent-257, score-0.202]

98 Experimental results show good agreement with our theoretical analysis and that our approach finds the optimal policy efficiently. [sent-258, score-0.189]

99 In 1 the worst case, the agent can always choose to be optimistic, like in E3 and Rmax. [sent-262, score-0.188]

100 An optimistic agent uses a large u1 , which usually leads to over-exploration but ensures convergence to optimality. [sent-263, score-0.231]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rpr', 0.825), ('pexplore', 0.241), ('agent', 0.188), ('exploration', 0.187), ('policy', 0.166), ('episodes', 0.158), ('rmax', 0.15), ('exploitation', 0.116), ('episode', 0.083), ('rt', 0.083), ('pomdp', 0.077), ('ok', 0.075), ('hknown', 0.07), ('dual', 0.07), ('reward', 0.068), ('umin', 0.062), ('primary', 0.061), ('actions', 0.052), ('ak', 0.05), ('accrued', 0.05), ('oracle', 0.049), ('vf', 0.048), ('auxiliary', 0.046), ('rewards', 0.043), ('optimality', 0.04), ('reinforcement', 0.038), ('environment', 0.037), ('minus', 0.034), ('balance', 0.033), ('qt', 0.033), ('belief', 0.031), ('pbvi', 0.031), ('regionalized', 0.031), ('et', 0.028), ('shuttle', 0.027), ('pomdps', 0.026), ('tk', 0.025), ('ktotal', 0.023), ('pexlpore', 0.023), ('rmeta', 0.023), ('optimal', 0.023), ('action', 0.021), ('credit', 0.021), ('balancing', 0.021), ('phase', 0.021), ('observable', 0.02), ('explores', 0.02), ('accumulative', 0.02), ('terminated', 0.02), ('optimistic', 0.02), ('experience', 0.02), ('query', 0.02), ('discounted', 0.02), ('policies', 0.019), ('abbreviation', 0.019), ('receives', 0.019), ('ii', 0.018), ('ht', 0.018), ('experiences', 0.018), ('partially', 0.017), ('posterior', 0.016), ('stops', 0.016), ('discredited', 0.016), ('rtk', 0.016), ('updated', 0.014), ('dir', 0.014), ('exploit', 0.014), ('convergence', 0.014), ('adaptively', 0.014), ('horizon', 0.014), ('absorbing', 0.014), ('lb', 0.014), ('upon', 0.014), ('active', 0.013), ('variational', 0.013), ('history', 0.013), ('mdp', 0.013), ('liao', 0.012), ('monotonically', 0.012), ('denoting', 0.012), ('sixteenth', 0.012), ('pineau', 0.012), ('proper', 0.011), ('querying', 0.011), ('termination', 0.011), ('histories', 0.011), ('subscripts', 0.011), ('mk', 0.01), ('prior', 0.01), ('parameterize', 0.01), ('enhancing', 0.01), ('regions', 0.01), ('chance', 0.01), ('executing', 0.01), ('cost', 0.009), ('ensures', 0.009), ('environments', 0.009), ('bayesian', 0.009), ('steps', 0.009), ('indexes', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

2 0.20469134 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.14517483 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

4 0.13513424 53 nips-2009-Complexity of Decentralized Control: Special Cases

Author: Martin Allen, Shlomo Zilberstein

Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1

5 0.093707323 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum

Abstract: Everyday social interactions are heavily influenced by our snap judgments about others’ goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e.g., that one agent is ‘helping’ or ‘hindering’ another’s attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agent’s behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative. 1

6 0.092450134 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

7 0.090296738 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.088983938 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

9 0.074935295 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

10 0.060883027 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

11 0.056267921 177 nips-2009-On Learning Rotations

12 0.053785164 52 nips-2009-Code-specific policy gradient rules for spiking neurons

13 0.04857048 221 nips-2009-Solving Stochastic Games

14 0.043853935 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

15 0.042902522 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

16 0.037443295 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

17 0.035719644 14 nips-2009-A Parameter-free Hedging Algorithm

18 0.025906619 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

19 0.024457447 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

20 0.022628902 181 nips-2009-Online Learning of Assignments


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.077), (1, 0.029), (2, 0.155), (3, -0.178), (4, -0.209), (5, 0.019), (6, 0.032), (7, -0.007), (8, -0.006), (9, 0.042), (10, 0.117), (11, 0.113), (12, 0.025), (13, 0.072), (14, 0.015), (15, -0.023), (16, -0.017), (17, 0.027), (18, 0.034), (19, 0.041), (20, 0.028), (21, -0.006), (22, -0.027), (23, -0.021), (24, 0.046), (25, -0.057), (26, -0.062), (27, -0.039), (28, 0.012), (29, -0.03), (30, 0.015), (31, -0.063), (32, 0.026), (33, 0.041), (34, -0.046), (35, 0.029), (36, -0.039), (37, 0.002), (38, 0.03), (39, 0.073), (40, 0.048), (41, -0.011), (42, 0.059), (43, -0.034), (44, 0.034), (45, 0.046), (46, -0.073), (47, 0.009), (48, 0.083), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96539527 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

2 0.76025015 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

3 0.70393008 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

Author: George Konidaris, Andre S. Barreto

Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance benefits in a challenging continuous domain. 1

4 0.70214915 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

5 0.66891378 53 nips-2009-Complexity of Decentralized Control: Special Cases

Author: Martin Allen, Shlomo Zilberstein

Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1

6 0.6214242 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

7 0.62029177 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

8 0.61833918 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

9 0.5957675 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

10 0.48717642 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

11 0.40432653 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

12 0.37233207 14 nips-2009-A Parameter-free Hedging Algorithm

13 0.31728521 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming

14 0.29071417 177 nips-2009-On Learning Rotations

15 0.27159724 54 nips-2009-Compositionality of optimal control laws

16 0.26705503 221 nips-2009-Solving Stochastic Games

17 0.26663011 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

18 0.26439679 52 nips-2009-Code-specific policy gradient rules for spiking neurons

19 0.26097962 69 nips-2009-Discrete MDL Predicts in Total Variation

20 0.24031891 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.03), (25, 0.036), (35, 0.067), (36, 0.075), (39, 0.025), (51, 0.342), (58, 0.063), (61, 0.114), (71, 0.056), (86, 0.049), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77939939 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

2 0.58237278 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as â„“1 -norm for promoting sparsity. We develop a new online algorithm, the regularized dual averaging (RDA) method, that can explicitly exploit the regularization structure in an online setting. In particular, at each iteration, the learning variables are adjusted by solving a simple optimization problem that involves the running average of all past subgradients of the loss functions and the whole regularization term, not just its subgradient. Computational experiments show that the RDA method can be very effective for sparse online learning with â„“1 -regularization. 1

3 0.56114161 196 nips-2009-Quantification and the language of thought

Author: Charles Kemp

Abstract: Many researchers have suggested that the psychological complexity of a concept is related to the length of its representation in a language of thought. As yet, however, there are few concrete proposals about the nature of this language. This paper makes one such proposal: the language of thought allows first order quantification (quantification over objects) more readily than second-order quantification (quantification over features). To support this proposal we present behavioral results from a concept learning study inspired by the work of Shepard, Hovland and Jenkins. Humans can learn and think about many kinds of concepts, including natural kinds such as elephant and water and nominal kinds such as grandmother and prime number. Understanding the mental representations that support these abilities is a central challenge for cognitive science. This paper proposes that quantification plays a role in conceptual representation—for example, an animal X qualifies as a predator if there is some animal Y such that X hunts Y . The concepts we consider are much simpler than real-world examples such as predator, but even simple laboratory studies can provide important clues about the nature of mental representation. Our approach to mental representation is based on the language of thought hypothesis [1]. As pursued here, the hypothesis proposes that mental representations are constructed in a compositional language of some kind, and that the psychological complexity of a concept is closely related to the length of its representation in this language [2, 3, 4]. Following previous researchers [2, 4], we operationalize the psychological complexity of a concept in terms of the ease with which it is learned and remembered. Given these working assumptions, the remaining challenge is to specify the representational resources provided by the language of thought. Some previous studies have relied on propositional logic as a representation language [2, 5], but we believe that the resources of predicate logic are needed to capture the structure of many human concepts. In particular, we suggest that the language of thought can accommodate relations, functions, and quantification, and focus here on the role of quantification. Our primary proposal is that quantification is supported by the language of thought, but that quantification over objects is psychologically more natural than quantification over features. To test this idea we compare concept learning in two domains which are very similar except for one critical difference: one domain allows quantification over objects, and the other allows quantification over features. We consider several logical languages that can be used to formulate concepts in both domains, and find that learning times are best predicted by a language that supports quantification over objects but not features. Our work illustrates how theories of mental representation can be informed by comparing concept learning across two or more domains. Existing studies work with a range of domains, and it is useful to consider a “conceptual universe” that includes these possibilities along with many others that have not yet been studied. Table 1 charts a small fragment of this universe, and the penultimate column shows example stimuli that will be familiar from previous studies of concept learning. Previous studies have made important contributions by choosing a single domain in Table 1 and explaining 1 why some concepts within this domain are easier to learn than others [2, 4, 6, 7, 8, 9]. Comparisons across domains can also provide important information about learning and mental representation, and we illustrate this claim by comparing learning times across Domains 3 and 4. The next section introduces the conceptual universe in Table 1 in more detail. We then present a formal approach to concept learning that relies on a logical language and compare three candidate languages. Language OQ (for object quantification) supports quantification over objects but not features, language F Q (for feature quantification) supports quantification over features but not objects, and language OQ + F Q supports quantification over both objects and features. We use these languages to predict learning times across Domains 3 and 4, and present an experiment which suggests that language OQ comes closest to the language of thought. 1 The conceptual universe Table 1 provides an organizing framework for thinking about the many domains in which learning can occur. The table includes 8 domains, each of which is defined by specifying some number of objects, features, and relations, and by specifying the range of each feature and each relation. We refer to the elements in each domain as items, and the penultimate column of Table 1 shows items from each domain. The first row shows a domain commonly used by studies of Boolean concept learning. Each item in this domain includes a single object a and specifies whether that object has value v1 (small) or v2 (large) on feature F (size), value v3 (white) or v4 (gray) on feature G (color), and value v5 (vertical) or v6 (horizontal) on feature H (texture). Domain 2 also includes three features, but now each item includes three objects and each feature applies to only one of the objects. For example, feature H (texture) applies to only the third object in the domain (i.e. the third square on each card). Domain 3 is similar to Domain 1, but now the three features can be aligned— for any given item each feature will be absent (value 0) or present. The example in Table 1 uses three features (boundary, dots, and slash) that can each be added to an unadorned gray square. Domain 4 is similar to Domain 2, but again the feature values can be aligned, and the feature for each object will be absent (value 0) or present. Domains 5 and 6 are similar to domains 2 and 4 respectively, but each one includes relations rather than features. In Domain 6, for example, the relation R assigns value 0 (absent) or value 1 (present) to each undirected pair of objects. The first six domains in Table 1 are all variants of Domain 1, which is the domain typically used by studies of Boolean concept learning. Focusing on six related domains helps to establish some of the dimensions along which domains can differ, but the final two domains in Table 1 show some of the many alternative possibilities. Domain 7 includes two categorical features, each of which takes three rather than two values. Domain 8 is similar to Domain 6, but now the number of objects is 6 rather than 3 and relation R is directed rather than undirected. To mention just a handful of possibilities which do not appear in Table 1, domains may also have categorical features that are ordered (e.g. a size feature that takes values small, medium, and large), continuous valued features or relations, relations with more than two places, and objects that contain sub-objects or parts. Several learning problems can be formulated within any given domain. The most basic is to learn a single item—for example, a single item from Domain 8 [4]. A second problem is to learn a class of items—for example, a class that includes four of the items in Domain 1 and excludes the remaining four [6]. Learning an item class can be formalized as learning a unary predicate defined over items, and a natural extension is to consider predicates with two or more arguments. For example, problems of the form A is to B as C is to ? can be formulated as problems where the task is to learn a binary relation analogous(·, ·) given the single example analogous(A, B). Here, however, we focus on the task of learning item classes or unary predicates. Since we focus on the role of quantification, we will work with domains where quantification is appropriate. Quantification over objects is natural in cases like Domain 4 where the feature values for all objects can be aligned. Note, for example, that the statement “every object has its feature” picks out the final example item in Domain 4 but that no such statement is possible in Domain 2. Quantification over features is natural in cases like Domain 3 where the ranges of each feature can be aligned. For example, “object a has all three features” picks out the final example item in Domain 3 but no such statement is possible in Domain 1. We therefore focus on Domains 3 and 4, and explore the problem of learning item classes in each domain. 2 3 {a} {a, b, c} {a} {a, b, c} {a, b, c} {a, b, c} {a} {a, b, c, d, e, f } 1 2 3 4 5 6 7 8 R : O × O → {0, 1} — F : O → {v1 , v2 , v3 } G : O → {v4 , v5 , v6 } — R : O × O → {0, 1} R : (a, b) → {v1 , v2 } S : (a, c) → {v3 , v4 } T : (b, c) → {v5 , v6 } — — — — Relations — — Domain specification Features F : O → {v1 , v2 } G : O → {v3 , v4 } H : O → {v5 , v6 } F : a → {v1 , v2 } G : b → {v3 , v4 } H : c → {v5 , v6 } F : O → {0, v1 } G : O → {0, v2 } H : O → {0, v3 } F : a → {0, v1 } G : b → {0, v2 } H : c → {0, v3 } , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ... , ... , Example Items , , , , , , , , , , , , , ... , [4] [8, 9] [13] [6] [12] [6] [2, 6, 7, 10, 11] Ref. Table 1: The conceptual universe. Eight domains are shown, and each one is defined by a set of objects, a set of features, and a set of relations. We call the members of each domain items, and an item is created by specifying the extension of each feature and relation in the domain. The six domains above the double lines are closely related to the work of Shepard et al. [6]. Each one includes eight items which differ along three dimensions. These dimensions, however, emerge from different underlying representations in the six cases. Objects O # (a) (b) 1 (I) 2 (II) 3 (III) 4 (III) 5 (IV) 6 (IV) 7 (V) 8 (V) 9 (V) 10 (VI) 111 110 101 011 100 010 001 000 Figure 1: (a) A stimulus lattice for domains (e.g. Domains 3, 4, and 6) that can be encoded as a triple of binary values where 0 represents “absent” and 1 represents “present.” (b) If the order of the values in the triple is not significant, there are 10 distinct ways to partition the lattice into two classes of four items. The SHJ type for each partition is shown in parentheses. Domains 3 and 4 both include 8 items each and we will consider classes that include exactly four of these items. Each item in these domains can be represented as a triple of binary values, where 0 indicates that a feature is absent and value 1 indicates that a feature is present. Each triple represents the values of the three features (Domain 3) or the feature values for the three objects (Domain 4). By representing each domain in this way, we have effectively adopted domain specifications that are simplifications of those shown in Table 1. Domain 3 is represented using three features of the form F, G, H : O → {0, 1}, and Domain 4 is represented using a single feature of the form F : O → {0, 1}. Simplifications of this kind are possible because the features in each domain can be aligned—notice that no corresponding simplifications are possible for Domains 1 and 2. The eight binary triples in each domain can be organized into the lattice shown in Figure 1a. Here we consider all ways to partition the vertices of the lattice into two groups of four. If partitions that differ only up to a permutation of the features (Domain 3) or objects (Domain 4) are grouped into equivalence classes, there are ten of these classes, and a representative of each is shown in Figure 1b. Previous researchers [6] have pointed out that the stimuli in Domain 1 can be organized into a cube similar to Figure 1a, and that there are six ways to partition these stimuli into two groups of four up to permutations of the features and permutations of the range of each feature. We refer to these equivalence classes as the six Shepard-Hovland-Jenkins types (or SHJ types), and each partition in Figure 1b is labeled with its corresponding SHJ type label. Note, for example, that partitions 3 and 4 are both examples of SHJ type III. For us, partitions 3 and 4 are distinct since items 000 (all absent) and 111 (all present) are uniquely identifiable, and partition 3 assigns these items to different classes but partition 4 does not. Previous researchers have considered differences between some of the first six domains in Table 1. Shepard et al. [6] ran experiments using compact stimuli (Domain 1) and distributed stimuli (Domains 2 and 4), and observed the same difficulty ranking of the six SHJ types in all cases. Their work, however, does not acknowledge that Domain 4 leads to 10 distinct types rather than 6, and therefore fails to address issues such as the relative complexities of concepts 5 and 6 in Figure 1. Social psychologists [13, 14] have studied Domain 6 and found that learning patterns depart from the standard SHJ order—in particular, that SHJ type VI (Concept 10 in Figure 1) is simpler than types III, IV and V. This finding has been used to support the claim that social learning relies on a domain-specific principle of structural balance [14]. We will see, however, that the relative simplicity of type VI in domains like 4 and 6 is consistent with a domain-general account based on representational economy. 2 A representation length approach to concept learning The conceptual universe in Table 1 calls for an account of learning that can apply across many domains. One candidate is the representation length approach, which proposes that concepts are mentally represented in a language of thought, and that the subjective complexity of a concept is 4 determined by the length of its representation in this language [4]. We consider the case where a concept corresponds to a class of items, and explore the idea that these concepts are mentally represented in a logical language. More formally, a concept is represented as a logical sentence, and the concept includes all models of this sentence, or all items that make the sentence true. The predictions of this representation length approach depend critically on the language chosen. Here we consider three languages—an object quantification language OQ that supports quantification over objects, a feature quantification language F Q that supports quantification over features, and a language OQ + F Q that supports quantification over both objects and features. Language OQ is based on a standard logical language known as predicate logic with equality. The language includes symbols representing objects (e.g. a and b), and features (e.g. F and G) and these symbols can be combined to create literals that indicate that an object does (Fa ) or does not have a certain feature (Fa ′ ). Literals can be combined using two connectives: AND (Fa Ga ) and OR (Fa + Ga ). The language includes two quantifiers—for all (∀) and there exists (∃)—and allows quantification over objects (e.g. ∀x Fx , where x is a variable that ranges over all objects in the domain). Finally, language OQ includes equality and inequality relations (= and =) which can be used to compare objects and object variables (e.g. =xa or =xy ). Table 2 shows several sentences formulated in language OQ. Suppose that the OQ complexity of each sentence is defined as the number of basic propositions it contains, where a basic proposition can be a positive or negative literal (Fa or Fa ′ ) or an equality or inequality statement (=xa or =xy ). Equivalently, the complexity of a sentence is the total number of ANDs plus the total number of ORs plus one. This measure is equivalent by design to Feldman’s [2] notion of Boolean complexity when applied to a sentence without quantification. The complexity values in Table 2 show minimal complexity values for each concept in Domains 3 and 4. Table 2 also shows a single sentence that achieves each of these complexity values, although some concepts admit multiple sentences of minimal complexity. The complexity values in Table 2 were computed using an “enumerate then combine” approach. We began by enumerating a set of sentences according to criteria described in the next paragraph. Each sentence has an extension that specifies which items in the domain are consistent with the sentence. Given the extensions of all sentences generated during the enumeration phase, the combination phase considered all possible ways to combine these extensions using conjunctions or disjunctions. The procedure terminated once extensions corresponding to all of the concepts in the domain had been found. Although the number of possible sentences grows rapidly as the complexity of these sentences increases, the number of extensions is fixed and relatively small (28 for domains of size 8). The combination phase is tractable since sentences with the same extension can be grouped into a single equivalence class. The enumeration phase considered all formulae which had at most two quantifiers and which had a complexity value lower than four. For example, this phase did not include the formula ∃x ∃y ∃z =yz F′ Fy Fz (too many quantifiers) or the formula ∀x ∃y =xy Fy (Fx + Gx + Hx ) (complexity x too high). Despite these restrictions, we believe that the complexity values in Table 2 are identical to the values that would be obtained if we had considered all possible sentences. Language F Q is similar to OQ but allows quantification over features rather than objects. For example, F Q includes the statement ∀Q Qa , where Q is a variable that ranges over all features in the domain. Language F Q also allows features and feature variables to be compared for equality or inequality (e.g. =QF or =QR ). Since F Q and OQ are closely related, it follows that the F Q complexity values for Domains 3 and 4 are identical to the OQ complexity values for Domains 4 and 3. For example, F Q can express concept 5 in Domain 3 as ∀Q ∃R =QR Ra . We can combine OQ and F Q to create a language OQ + F Q that allows quantification over both objects and features. Allowing both kinds of quantification leads to identical complexity values for Domains 3 and 4. Language OQ + F Q can express each of the formulae for Domain 4 in Table 2, and these formulae can be converted into corresponding formulae for Domain 3 by translating each instance of object quantification into an instance of feature quantification. Logicians distinguish between first-order logic, which allows quantification over objects but not predicates, and second-order logic, which allows quantification over objects and predicates. The difference between languages OQ and OQ + F Q is superficially similar to the difference between first-order and second-order logic, but does not cut to the heart of this matter. Since language 5 # 1 Domain 3 Domain 4 C 1 Ga C 1 Fb 2 Fa Ha + Fa Ha 4 Fa Fc + Fa Fc 4 3 Fa ′ Ga + Fa Ha 4 Fa ′ Fb + Fa Fc 4 4 Fa ′ Ga ′ + Fa Ha 4 Fa ′ Fb ′ + Fa Fc 4 5 Ga (Fa + Ha ) + Fa Ha 2 6 7 8 ′ ′ ′ ′ 5 ∀x ∃y =xy Fy ′ 5 ′ ′ 6 Ga (Fa + Ha ) + Fa Ha Ga (Fa + Ha ) + Fa Ga Ha 3 (∀x Fx ) + Fb ∃y Fy ′ ′ ′ (∀x Fx ) + Fb (Fa + Fc ) 4 ′ ′ ′ 6 ′ ′ 6 (∀x Fx ) + Fa (Fb + Fc ) 4 10 (∀x Fx ) + ∃y ∀z Fy (=zy +Fz ′ ) 4 Ha (Fa + Ga ) + Fa Ga Ha 9 Fa (Ga + Ha ) + Fa Ga Ha 10 Ga ′ (Fa Ha ′ + Fa ′ Ha ) + Ga (Fa ′ Ha ′ + Fa Ha ) ′ ′ ′ Fc (Fa + Fb ) + Fa Fb Fc ′ ′ 6 Table 2: Complexity values C and corresponding formulae for language OQ. Boolean complexity predicts complexity values for both domains that are identical to the OQ complexity values shown here for Domain 3. Language F Q predicts complexity values for Domains 3 and 4 that are identical to the OQ values for Domains 4 and 3 respectively. Language OQ + F Q predicts complexity values for both domains that are identical to the OQ complexity values for Domain 4. OQ + F Q only supports quantification over a pre-specified set of features, it is equivalent to a typed first order logic that includes types for objects and features [15]. Future studies, however, can explore the cognitive relevance of higher-order logic as developed by logicians. 3 Experiment Now that we have introduced languages OQ, F Q and OQ + F Q our theoretical proposals can be sharply formulated. We suggest that quantification over objects plays an important role in mental representations, and predict that OQ complexity will account better for human learning than Boolean complexity. We also propose that quantification over objects is more natural than quantification over features, and predict that OQ complexity will account better for human learning than both F Q complexity and OQ + F Q complexity. We tested these predictions by designing an experiment where participants learned concepts from Domains 3 and 4. Method. 20 adults participated for course credit. Each participant was assigned to Domain 3 or Domain 4 and learned all ten concepts from that domain. The items used for each domain were the cards shown in Table 1. Note, for example, that each Domain 3 card showed one square, and that each Domain 4 card showed three squares. These items are based on stimuli developed by Sakamoto and Love [12]. The experiment was carried out using a custom built graphical interface. For each learning problem in each domain, all eight items were simultaneously presented on the screen, and participants were able to drag them around and organize them however they liked. Each problem had three phases. During the learning phase, the four items belonging to the current concept had red boundaries, and the remaining four items had blue boundaries. During the memory phase, these colored boundaries were removed, and participants were asked to sort the items into the red group and the blue group. If they made an error they returned to the learning phase, and could retake the test whenever they were ready. During the description phase, participants were asked to provide a written description of the two groups of cards. The color assignments (red or blue) were randomized across participants— in other words, the “red groups” learned by some participants were identical to the “blue groups” learned by others. The order in which participants learned the 10 concepts was also randomized. Model predictions. The OQ complexity values for the ten concepts in each domain are shown in Table 2 and plotted in Figure 2a. The complexity values in Figure 2a have been normalized so that they sum to one within each domain, and the differences of these normalized scores are shown in the final row of Figure 2a. The two largest bars in the difference plot indicate that Concepts 10 and 5 are predicted to be easier to learn in Domain 4 than in Domain 3. Language OQ can express 6 OQ complexity Domain 3 a) Learning time b) 0.2 0.2 0.1 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Difference Domain 4 0.2 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.1 0 0 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 0.1 0.05 0 −0.05 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Figure 2: Normalized OQ complexity values and normalized learning times for the 10 concepts in Domains 3 and 4. statements like “either 1 or 3 objects have F ” (Concept 10 in Domain 4), or “2 or more objects have F ” (Concept 5 in Domain 4). Since quantification over features is not permitted, however, analogous statements (e.g. “object a has either 1 or 3 features”) cannot be formulated in Domain 3. Concept 10 corresponds to SHJ type VI, which often emerges as the most difficult concept in studies of Boolean concept learning. Our model therefore predicts that the standard ordering of the SHJ types will not apply in Domain 4. Our model also predicts that concepts assigned to the same SHJ type will have different complexities. In Domain 4 the model predicts that Concept 6 will be harder to learn than Concept 5 (both are examples of SHJ type IV), and that Concept 8 will be harder to learn than Concepts 7 or 9 (all three are examples of SHJ type V). Results. The computer interface recorded the amount of time participants spent on the learning phase for each concept. Domain 3 was a little more difficult than Domain 4 overall: on average, Domain 3 participants took 557 seconds and Domain 4 participants took 467 seconds to learn the 10 concepts. For all remaining analyses, we consider learning times that are normalized to sum to 1 for each participant. Figure 2b shows the mean values for these normalized times, and indicates the relative difficulties of the concepts within each condition. The difference plot in Figure 2b supports the two main predictions identified previously. Concepts 10 and 5 are the cases that differ most across the domains, and both concepts are easier to learn in Domain 3 than Domain 4. As predicted, Concept 5 is substantially easier than Concept 6 in Domain 4 even though both correspond to the same SHJ type. Concepts 7 through 9 also correspond to the same SHJ type, and the data for Domain 4 suggest that Concept 8 is the most difficult of the three, although the difference between Concepts 8 and 7 is not especially large. Four sets of complexity predictions are plotted against the human data in Figure 3. Boolean complexity and OQ complexity make identical predictions about Domain 3, and OQ complexity and OQ + F Q complexity make identical predictions about Domain 4. Only OQ complexity, however, accounts for the results observed in both domains. The concept descriptions generated by participants provide additional evidence that there are psychologically important differences between Domains 3 and 4. If the descriptions for concepts 5 and 10 are combined, 18 out of 20 responses in Domain 4 referred to quantification or counting. One representative description of Concept 5 stated that “red has multiple filled” and that “blue has one filled or none.” Only 3 of 20 responses in Domain 3 mentioned quantification. One representative description of Concept 5 stated that “red = multiple features” and that “blue = only one feature.” 7 r=0.84 0.2 r=0.84 0.2 r=0.26 0.2 r=0.26 0.2 Learning time (Domain 3) 0.1 0.1 0 (Domain 4) 0.2 r=0.27 0.2 Learning time 0.1 0.1 0 0.2 r=0.83 0.2 0.1 0.1 0 0.1 0.2 0 0.1 0.2 r=0.27 0.2 0.1 Boolean complexity 0.1 0.1 0.2 OQ complexity 0.1 0.2 r=0.83 0.2 0.1 0 0 0.1 0 0.1 0.2 F Q complexity 0 0.1 0.2 OQ + F Q complexity Figure 3: Normalized learning times for each domain plotted against normalized complexity values predicted by four languages: Boolean logic, OQ, F Q and OQ + F Q. These results suggest that people can count or quantify over features, but that it is psychologically more natural to quantify over objects rather than features. Although we have focused on three specific languages, the results in Figure 2b can be used to evaluate alternative proposals about the language of thought. One such alternative is an extension of Language OQ that allows feature values to be compared for equality. This extended language supports concise representations of Concept 2 in both Domain 3 (Fa = Ha ) and Domain 4 (Fa = Fc ), and predicts that Concept 2 will be easier to learn than all other concepts except Concept 1. Note, however, that this prediction is not compatible with the data in Figure 2b. Other languages might also be considered, but we know of no simple language that will account for our data better than OQ. 4 Conclusion Comparing concept learning across qualitatively different domains can provide valuable information about the nature of mental representation. We compared two domains that that are similar in many respects, but that differ according to whether they include a single object (Domain 3) or multiple objects (Domain 4). Quantification over objects is possible in Domain 4 but not Domain 3, and this difference helps to explain the different learning patterns we observed across the two domains. Our results suggest that concept representations can incorporate quantification, and that quantifying over objects is more natural than quantifying over features. The model predictions we reported are based on a language (OQ) that is a generic version of first order logic with equality. Our results therefore suggest that some of the languages commonly considered by logicians (e.g. first order logic with equality) may indeed capture some aspects of the “laws of thought” [16]. A simple language like OQ offers a convenient way to explore the role of quantification, but this language will need to be refined and extended in order to provide a more accurate account of mental representation. For example, a comprehensive account of the language of thought will need to support quantification over features in some cases, but might be formulated so that quantification over features is typically more costly than quantification over objects. Many possible representation languages can be imagined and a large amount of empirical data will be needed to identify the language that comes closest to the language of thought. Many relevant studies have already been conducted [2, 6, 8, 9, 13, 17], but there are vast regions of the conceptual universe (Table 1) that remain to be explored. Navigating this universe is likely to involve several challenges, but web-based experiments [18, 19] may allow it to be explored at a depth and scale that are currently unprecedented. Characterizing the language of thought is undoubtedly a long term project, but modern methods of data collection may support rapid progress towards this goal. Acknowledgments I thank Maureen Satyshur for running the experiment. This work was supported in part by NSF grant CDI-0835797. 8 References [1] J. A. Fodor. The language of thought. Harvard University Press, Cambridge, 1975. [2] J. Feldman. Minimization of Boolean complexity in human concept learning. Nature, 407: 630–633, 2000. [3] D. Fass and J. Feldman. Categorization under complexity: A unified MDL account of human learning of regular and irregular categories. In S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 35–34. MIT Press, Cambridge, MA, 2003. [4] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [5] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Griffiths. A rational analysis of rule-based concept learning. Cognitive Science, 32(1):108–154, 2008. [6] R. N. Shepard, C. I. Hovland, and H. M. Jenkins. Learning and memorization of classifications. Psychological Monographs, 75(13), 1961. Whole No. 517. [7] R. M. Nosofsky, M. Gluck, T. J. Palmeri, S. C. McKinley, and P. Glauthier. Comparing models of rule-based classification learning: A replication and extension of Shepard, Hovland, and Jenkins (1961). Memory and Cognition, 22:352–369, 1994. [8] M. D. Lee and D. J. Navarro. Extending the ALCOVE model of category learning to featural stimulus domains. Psychonomic Bulletin and Review, 9(1):43–58, 2002. [9] C. D. Aitkin and J. Feldman. Subjective complexity of categories defined over three-valued features. In R. Sun and N. Miyake, editors, Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 961–966. Psychology Press, New York, 2006. [10] F. Mathy and J. Bradmetz. A theory of the graceful complexification of concepts and their learnability. Current Psychology of Cognition, 22(1):41–82, 2004. [11] R. Vigo. A note on the complexity of Boolean concepts. Journal of Mathematical Psychology, 50:501–510, 2006. [12] Y. Sakamoto and B. C. Love. Schematic influences on category learning and recognition memory. Journal of Experimental Psychology: General, 133(4):534–553, 2004. [13] W. H. Crockett. Balance, agreement and positivity in the cognition of small social structures. In Advances in Experimental Social Psychology, Vol 15, pages 1–57. Academic Press, 1982. [14] N. B. Cottrell. Heider’s structural balance principle as a conceptual rule. Journal of Personality and Social Psychology, 31(4):713–720, 1975. [15] H. B. Enderton. A mathematical introduction to logic. Academic Press, New York, 1972. [16] G. Boole. An investigation of the laws of thought on which are founded the mathematical theories of logic and probabilities. 1854. [17] B. C. Love and A. B. Markman. The nonindependence of stimulus properties in human category learning. Memory and Cognition, 31(5):790–799, 2003. [18] L. von Ahn. Games with a purpose. Computer, 39(6):92–94, 2006. [19] R. Snow, B. O’Connor, D. Jurafsky, and A. Ng. Cheap and fast–but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on empirical methods in natural language processing, pages 254–263. Association for Computational Linguistics, 2008. 9

4 0.47091472 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difficult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

5 0.46330214 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

6 0.46140313 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

7 0.45208597 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

8 0.45091635 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

9 0.43843934 33 nips-2009-Analysis of SVM with Indefinite Kernels

10 0.4328317 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

11 0.41890836 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

12 0.41381347 133 nips-2009-Learning models of object structure

13 0.41239139 113 nips-2009-Improving Existing Fault Recovery Policies

14 0.40666541 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

15 0.40542856 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

16 0.3981024 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

17 0.39270207 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

18 0.3915391 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

19 0.39070195 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

20 0.38936859 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference