nips nips2009 nips2009-242 knowledge-graph by maker-knowledge-mining

242 nips-2009-The Infinite Partially Observable Markov Decision Process

Source: pdf

Author: Finale Doshi-velez

Abstract: The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. Unfortunately, most POMDPs are complex structures with a large number of parameters. In many real-world problems, both the structure and the parameters are difﬁcult to specify from domain knowledge alone. Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model. We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. We demonstrate the iPOMDP on several standard problems. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains where agents must balance actions that provide knowledge and actions that provide reward. [sent-3, score-0.383]

2 We deﬁne an inﬁnite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and only models visited states explicitly. [sent-7, score-1.199]

3 Current methods in reinforcement learning (RL) focus on learning the parameters online, that is, while the agent is acting in its environment. [sent-11, score-0.569]

4 Bayesian RL [1, 2, 3] has recently received attention because it allows the agent to reason both about uncertainty in its model of the environment and uncertainty within environment itself. [sent-12, score-0.659]

5 All of these approaches provide the agent with the size of the underlying state space and focus on learning the transition and observation1 dynamics for each state. [sent-15, score-0.728]

6 Even when the size of the state space is known, however, just making the agent reason about a large number of unknown parameters at the beginning of the learning process is fraught with difﬁculties. [sent-16, score-0.601]

7 The agent has insufﬁcient experience to ﬁt a large number of parameters, and therefore much of the model will be highly uncertain. [sent-17, score-0.529]

8 Trying to plan under vast model uncertainty often requires signiﬁcant computational resources; moreover, the computations are often wasted effort when the agent has very little data. [sent-18, score-0.53]

9 1 We propose a nonparametric approach to modelling the structure of the underlying space— speciﬁcally, the number of states in the agent’s world—which allows the agent to start with a simple model and grow it with experience. [sent-21, score-0.695]

10 The agent is expected to stay in a local region; however, as time passes, it may explore states that it has not visited before. [sent-23, score-0.769]

11 Initially, the agent will infer simple, local models of the environment corresponding to its limited experience (also conducive to fast planning). [sent-24, score-0.612]

12 Finally, a data-driven approach to structure discovery allows the agent to agglomerate states with identical dynamics (see section 4 for a toy example). [sent-26, score-0.721]

13 The transition function T (s′ |s, a) deﬁnes the distribution over next-states s′ to which the agent may transition after taking action a from state s. [sent-29, score-0.971]

14 The observation function Ω(o|s′ , a) is a distribution over observations o that may occur in state s′ after taking action a. [sent-30, score-0.38]

15 The reward function R(s, a) speciﬁes the immediate reward for each state-action pair (see ﬁgure 1 for a slice of the graphical model). [sent-31, score-0.537]

16 We focus on discrete state and observation spaces (generalising to continuous observations is straightforward) and ﬁnite action spaces. [sent-33, score-0.381]

17 The Inﬁnite Hidden Markov Model A standard hidden Markov model (HMM) consists of the ntuple {S,O,T ,Ω}, where the transition T (s′ |s) and observation Ω(o|s′ ) distributions only depend on the hidden state. [sent-37, score-0.283]

18 When the number of hidden states is ﬁnite and discrete, Dirichlet distributions may be used as priors over the transition and observation distributions. [sent-38, score-0.428]

19 The iHMM [9] uses a hierarchical Dirichlet Process (HDP) to deﬁne a prior over HMMs where the number of underlying states is unbounded. [sent-39, score-0.196]

20 The second step uses these overall state popularities to deﬁne individual state transition distributions. [sent-49, score-0.305]

21 While the number of states ¯s decreases exponentially with s, meaning that “later” states are less popular. [sent-52, score-0.392]

22 The second step of the iHMM construction involves deﬁning the transition distributions T (·|s) ∼ ¯ DP(α, T ) for each state s, where α, the concentration parameter for the DP, determines how closely ¯ ¯ the sampled distribution T (·|s) matches the mean transition distribution T . [sent-55, score-0.36]

23 Because T puts higher probabilities on states with smaller indices, T (s′ |s) will also generally put more mass on earlier s′ (see lower rows of ﬁgure 2). [sent-56, score-0.196]

24 Thus, the generating process encodes a notion that the agent will spend most of its time in some local region. [sent-57, score-0.499]

25 However, the longer the agent acts in this inﬁnite space, the more likely it is to transition to somewhere new. [sent-58, score-0.6]

26 3 2 Inﬁnite POMDPs To extend the iHMM framework to iPOMDPs, we must incorporate actions and rewards into the generative model. [sent-61, score-0.231]

27 To incorporate actions, we draw an observation distribution Ω(·|s, a) ∼ H for each action a and each state s. [sent-62, score-0.344]

28 For this work, we assume that the set of possible reward values is given, and we use a multinomial distribution to describe the probability R(r|s, a) of observing reward r after taking action a in state s. [sent-66, score-0.77]

29 As with the observations, the reward distributions R are drawn from Dirichlet distribution HR . [sent-67, score-0.277]

30 We use multinomial distributions for convenience; however, other reward distributions (such as Gaussians) are easily incorporate in this framework. [sent-68, score-0.304]

31 Next, for each state s and action a, we sample ¯ • T (·|s, a) ∼ DP(α, T ) , • Ω(·|s, a) ∼ H, • R(·|s, a) ∼ HR . [sent-85, score-0.245]

32 During a ﬁnite lifetime the agent can only visit a ﬁnite number of states, and thus the agent can only make inferences about a ﬁnite number of states. [sent-87, score-0.998]

33 The remaining (inﬁnite) states are equivalent from agent’s perspective, as, in expectation, these states will exhibit the mean dynamics of the prior. [sent-88, score-0.418]

34 Thus, the only parts of the inﬁnite model that need to be initialised are those corresponding to the states the agent has visited as well as a catch-all state representing all other states. [sent-89, score-0.871]

35 In reality, of course, the agent does not know the states it has visited: we discuss joint inference over the unknown state history and the model in section 3. [sent-90, score-0.833]

36 3 Planning As in the standard Bayesian RL framework, we recast the problem of POMDP learning as planning in a larger ‘model-uncertainty’ POMDP in which both the true model and the true state are unknown. [sent-92, score-0.285]

37 We outline below our procedure for planning in this joint space of POMDP models and unknown states and the detail each step—belief monitoring and action-selection—in sections 3. [sent-93, score-0.389]

38 Because the true state is hidden, the agent must choose its actions based only on past actions and observations. [sent-96, score-0.871]

39 Normally the best action to take at time t depends on the entire history of actions and observations that the agent has taken so far. [sent-97, score-0.852]

40 However, it is intractable to express the joint belief b over models and states with a closed-form expression. [sent-102, score-0.413]

41 We approximate the belief b with a set of sampled models m = {T, Ω, R}, each with weight w(m). [sent-103, score-0.217]

42 Each model sample m maintains a belief over states bm (s). [sent-104, score-0.466]

43 The states are discrete, and thus the belief bm (s) can be updated using equation 1. [sent-105, score-0.466]

44 Given the belief, the agent must choose what action to choose next. [sent-108, score-0.642]

45 One approach is to solve the planning problem ofﬂine, that is, determine a good action for every possible belief. [sent-109, score-0.254]

46 While we might hope to solve equation 3 over the state space of a single model, it is intractable to solve over the joint space of states and inﬁnite models—the model space is so large that standard point-based approximations will generally fail. [sent-112, score-0.298]

47 1 Belief Monitoring As outlined in section 3, we approximate the joint belief over states and models through a set of samples. [sent-119, score-0.413]

48 To simplify matters, we assume that given a model m, it is tractable to maintain a closed-form belief bm (s) over states using equation 1. [sent-122, score-0.466]

49 Suppose we have a set of models m that have been drawn from the belief at time t. [sent-124, score-0.217]

50 To get a set of models drawn from the belief at time t+1, we can either draw the models directly from the new belief or adjust the weights on the model set at time t so that they now provide an accurate representation of the belief at time t + 1. [sent-125, score-0.663]

51 The advantage of simply reweighting the samples is that the belief update is extremely fast. [sent-127, score-0.215]

52 We adapt this approach to allow for observations with different temporal shifts (since the reward rt depends on the state st , whereas the observation ot is conditioned on the state st+1 ) and for transitions indexed by both the current state and the most recent action. [sent-131, score-0.767]

53 The inference alternates between three phases: 5 We will use the words posterior and belief interchangeably; both refer to the probability distribution over the hidden state given some initial belief (or prior) and the history of actions and observations. [sent-135, score-0.672]

54 Given a transition model T and a state trajectory {s1 , s2 , . [sent-137, score-0.203]

55 Given a trajectory over hidden states, transition, observation, and reward distributions are sampled for the visited states (it only makes sense to sample distributions for visited states, as we do not have information about unvisited states). [sent-147, score-0.699]

56 In this ﬁnite setting, we can resample the transitions T (·|s, a) using standard Dirichlet posteriors: ∞ sa sa sa T (·|s, a) ∼ Dirichlet(T1 + nsa , T2 + nsa , . [sent-148, score-0.364]

57 , Tk + nsa , 1 2 k Tisa ), (5) i=k+1 where k is the number of active or used states, Tisa is the prior probability of transitioning to state i from state s after taking action a, and nsa is the number of observed transitions i to state i from s after a. [sent-151, score-0.634]

58 The observations and rewards are resampled in a similar manner: for example, if the observations are discrete with Dirichlet priors: Ω(·|s, a) ∼ Dirichlet(H1 + no1 sa , H2 + no2 sa , . [sent-152, score-0.39]

59 The representation of the belief is necessarily approximate due to our use of samples, but the samples are drawn from the true current belief—no other approximations have been made. [sent-158, score-0.251]

60 Starting from the agent’s current belief, the tree branches on each action the agent might take and each observation the agent might see. [sent-164, score-1.218]

61 At each action node, the agent computes its expected immediate reward R(a) = Em [Es|m [R(·|s, a)]]. [sent-165, score-0.892]

62 From equation 3, the value of taking action a in belief b is Ω(o|b, a) max Q(a′ , bao ) ′ Q(a, b) = R(a, b) + γ a o (7) where bao is the agent’s belief after taking action a and seeing observation o from belief b. [sent-166, score-1.03]

63 Because action selection must be completed online, we use equation 4 to update the belief over models via the weights w(m). [sent-167, score-0.36]

64 For each model m in the leaves, we can compute the value Q(a, bm , m) of the action a by approximately 6 For an introduction to slice sampling, refer to [17]. [sent-173, score-0.267]

65 We approximate the value of action a as Q(a, b) ≈ w(m)Q(a, bm , m). [sent-175, score-0.23]

66 The quality of the action selection largely follows from the bounds presented in [20] for planning through forward search. [sent-178, score-0.297]

67 The key difference is that now our belief representation is particle-based; during the forward search we approximate an expected rewards over all possible models with rewards from the particles in our set. [sent-179, score-0.488]

68 Thus, we can apply the central limit theorem to state that the estimated expected rewards will 2 be distributed around the true expectation with approximately normal noise N (0, σ ), where n is n the number of POMDP samples and σ 2 is a problem-speciﬁc variance. [sent-181, score-0.284]

69 0 Dirichlet counts per element), and rewards were given hyperparameters that encouraged peaked distributions (0. [sent-184, score-0.197]

70 The small counts on the reward hyperparameters encoded the prior belief that R(·|s, a) is highly peaked, that is, each state-action pair will likely have one associated reward value. [sent-186, score-0.708]

71 (a) Cartoon of Models (b) Evolution of state size (c) Performance Figure 3: Various comparisons of the lineworld and loopworld models. [sent-194, score-0.346]

72 We designed a pair of simple environments to show how the iPOMDP infers states only as it can distinguish them. [sent-197, score-0.285]

73 The ﬁrst, lineworld was a length-six corridor in which the agent could either travel left or right. [sent-198, score-0.719]

74 Loopworld consisted of a corridor with a series of loops (see ﬁgure 3(a)); now the agent could travel though the upper or lower branches. [sent-199, score-0.597]

75 85 (that is, 15% of the time the agent saw an incorrect observation). [sent-206, score-0.499]

76 The agent started at the left end of the corridor and received a reward of -1 until it reached the opposite end (reward 10). [sent-207, score-0.819]

77 The agent eventually infers that the lineworld environment consists of six states—based on the number of steps it requires to reach the goal—although in the early stages of learning it infers distinct states only for the ends of the corridor and groups the middle region as one state. [sent-208, score-1.052]

78 The loopworld agent also shows a growth in the number of states over time (see ﬁgure 3(b)), but it never infers separate states for the identical upper and lower branches. [sent-209, score-1.071]

79 By inferring states as they needed to explain its observations—instead of relying on a prespeciﬁed number of states—the agent avoided the need to consider irrelevant structure in the environment. [sent-210, score-0.695]

80 Figure 3(c) shows that the agent (unsurprisingly) learns optimal performance in both environments. [sent-211, score-0.499]

81 In the tiger-3 domain, a variant of the tiger problem [19] the agent had to choose one of three doors to open. [sent-214, score-0.579]

82 Two doors had tigers behind them (r = −100) and one door had a small reward (r = 10). [sent-215, score-0.389]

83 At each time step, the agent could either open a door or listen for the “quiet” door. [sent-216, score-0.558]

84 Evolution of Reward −40 −60 Averaged Reward The reward was unlikely to be behind the third door (p = . [sent-219, score-0.337]

85 2), but during the ﬁrst 100 episodes, we artiﬁcially ensured that the reward was always behind doors 1 or 2. [sent-220, score-0.33]

86 The improving rewards in ﬁgure 4 show the agent steadily learning the dynamics of its world; it learned never to open door 3. [sent-221, score-0.698]

87 The dip in 4 following episode 100 occurs when we next allowed the reward to be behind all three doors, but the agent quickly adapts to the new possible state of its environment. [sent-222, score-0.923]

88 The iPOMDP enabled the agent to ﬁrst adapt quickly to its simpliﬁed environment but add complexity when it was needed. [sent-223, score-0.548]

89 Tests had 200 episodes of learning, which interleaved acting and re- Figure 4: Evolution of reward from sampling models, and 100 episodes of testing with the mod- tiger-3. [sent-226, score-0.39]

90 For situations where the number of states is not known, the last case is particularly interesting—we show that simply overestimating the number of states is not necessarily the most efﬁcient solution. [sent-233, score-0.392]

91 We see that the iPOMDP often infers a smaller number of states than the true count, ignoring distinctions that the history does not support. [sent-235, score-0.378]

92 5 Discussion Recent work in learning POMDP models include[23], which uses a set of Gaussian approximations to allow for analytic value function updates in the POMDP space, and [5], which jointly reasons over the space Dirichlet parameter and states when planning in discrete POMDPs. [sent-241, score-0.367]

93 During training (left), we see that the agent makes fewer mistakes toward the end of the period. [sent-243, score-0.499]

94 The boxplots on the right show rewards for 100 trials after learning has stopped; we see the iPOMDP-agent’s reward distribution over these 100 trials is almost identical an agent who had access to the correct model. [sent-244, score-0.863]

95 The iPOMDP agent (FFBS-Inf) often performs nearly as well as the agents who had knowledge of the true number of states (EMtrue, FFBS-true), learning the necessary number of states much faster than an agent for which we overestimate the number of states (FFBS-big). [sent-246, score-1.688]

96 All of these approaches assume that the number of underlying states is known; all but [7] focus on learning only the transition and observation models. [sent-269, score-0.35]

97 By giving the agent an unbounded state space—but strong locality priors—the iPOMDP provides one principled framework to learning POMDP structure. [sent-275, score-0.638]

98 The iPOMDP provides a principled framework for an agent to posit more complex models of its world as it gains more experience. [sent-278, score-0.557]

99 By linking the complexity of the model to the agent’s experience, the agent is not forced to consider large uncertainties—which can be computationally prohibitive— near the beginning of the planning process, but it can later come up with accurate models of the world when it requires them. [sent-279, score-0.668]

100 Singh, “Approximate planning for factored POMDPs using belief state simpliﬁcation,” in UAI 15, 1999. [sent-388, score-0.396]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agent', 0.499), ('ipomdp', 0.367), ('pomdp', 0.323), ('reward', 0.25), ('states', 0.196), ('belief', 0.183), ('pomdps', 0.178), ('action', 0.143), ('ihmm', 0.122), ('lineworld', 0.122), ('loopworld', 0.122), ('actions', 0.117), ('rewards', 0.114), ('planning', 0.111), ('state', 0.102), ('transition', 0.101), ('bm', 0.087), ('dirichlet', 0.084), ('pineau', 0.079), ('beam', 0.074), ('visited', 0.074), ('reinforcement', 0.07), ('corridor', 0.07), ('ffbs', 0.07), ('sa', 0.068), ('observable', 0.064), ('door', 0.059), ('infers', 0.058), ('observations', 0.057), ('nsa', 0.056), ('observation', 0.053), ('doors', 0.052), ('vlassis', 0.052), ('hidden', 0.051), ('dp', 0.049), ('environment', 0.049), ('transitions', 0.048), ('episodes', 0.047), ('sampling', 0.046), ('bao', 0.046), ('dialog', 0.046), ('shuttle', 0.046), ('hr', 0.046), ('draw', 0.046), ('nite', 0.045), ('sampler', 0.045), ('rl', 0.044), ('episode', 0.044), ('forward', 0.043), ('ross', 0.042), ('evolution', 0.039), ('partially', 0.038), ('agents', 0.038), ('unbounded', 0.037), ('slice', 0.037), ('true', 0.036), ('history', 0.036), ('tisa', 0.035), ('models', 0.034), ('markov', 0.034), ('samples', 0.032), ('stick', 0.032), ('uncertainty', 0.031), ('environments', 0.031), ('peaked', 0.031), ('count', 0.031), ('poupart', 0.031), ('experience', 0.03), ('gure', 0.029), ('concentration', 0.029), ('behind', 0.028), ('prespeci', 0.028), ('tiger', 0.028), ('overestimate', 0.028), ('doshi', 0.028), ('travel', 0.028), ('bt', 0.028), ('distributions', 0.027), ('st', 0.027), ('ot', 0.026), ('distinctions', 0.026), ('cartoon', 0.026), ('finale', 0.026), ('dynamics', 0.026), ('discrete', 0.026), ('ignoring', 0.026), ('bayesian', 0.025), ('taking', 0.025), ('expert', 0.025), ('ine', 0.025), ('counts', 0.025), ('tree', 0.024), ('world', 0.024), ('outline', 0.024), ('uai', 0.024), ('stochastically', 0.024), ('percentile', 0.024), ('monitoring', 0.024), ('gael', 0.024), ('ut', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

2 0.38516298 53 nips-2009-Complexity of Decentralized Control: Special Cases

Author: Martin Allen, Shlomo Zilberstein

Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1

3 0.31432995 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum

Abstract: Everyday social interactions are heavily inﬂuenced by our snap judgments about others’ goals. Even young infants can infer the goals of intentional agents from observing how they interact with objects and other agents in their environment: e.g., that one agent is ‘helping’ or ‘hindering’ another’s attempt to get up a hill or open a box. We propose a model for how people can infer these social goals from actions, based on inverse planning in multiagent Markov decision problems (MDPs). The model infers the goal most likely to be driving an agent’s behavior by assuming the agent acts approximately rationally given environmental constraints and its model of other agents present. We also present behavioral evidence in support of this model over a simpler, perceptual cue-based alternative. 1

4 0.24812877 113 nips-2009-Improving Existing Fault Recovery Policies

Author: Guy Shani, Christopher Meek

Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indeﬁnite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1

5 0.21018937 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure deﬁned by ﬁrst-order logic or other languages give rise to notoriously difﬁcult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring conﬁgurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efﬁcient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

6 0.20469134 134 nips-2009-Learning to Explore and Exploit in POMDPs

7 0.2037524 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

8 0.16134655 52 nips-2009-Code-specific policy gradient rules for spiking neurons

9 0.12877941 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

10 0.10628275 14 nips-2009-A Parameter-free Hedging Algorithm

11 0.086432591 221 nips-2009-Solving Stochastic Games

12 0.083737351 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

13 0.082801178 39 nips-2009-Bayesian Belief Polarization

14 0.078234687 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

15 0.074339546 187 nips-2009-Particle-based Variational Inference for Continuous Systems

16 0.061722882 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

17 0.060956955 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

18 0.060519427 226 nips-2009-Spatial Normalized Gamma Processes

19 0.05980115 205 nips-2009-Rethinking LDA: Why Priors Matter

20 0.059181157 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.18), (1, -0.029), (2, 0.279), (3, -0.348), (4, -0.356), (5, -0.024), (6, 0.078), (7, 0.015), (8, -0.046), (9, 0.043), (10, 0.175), (11, 0.173), (12, 0.074), (13, 0.059), (14, 0.036), (15, 0.04), (16, -0.054), (17, 0.036), (18, 0.05), (19, 0.073), (20, -0.038), (21, 0.043), (22, -0.166), (23, 0.002), (24, 0.163), (25, -0.029), (26, 0.097), (27, -0.039), (28, -0.038), (29, -0.044), (30, -0.015), (31, -0.017), (32, 0.03), (33, 0.026), (34, 0.04), (35, 0.022), (36, -0.023), (37, -0.032), (38, 0.054), (39, 0.012), (40, -0.025), (41, -0.03), (42, 0.002), (43, -0.007), (44, 0.008), (45, 0.039), (46, -0.009), (47, -0.007), (48, 0.018), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95958352 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

2 0.92143089 53 nips-2009-Complexity of Decentralized Control: Special Cases

Author: Martin Allen, Shlomo Zilberstein

3 0.89424384 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

Author: George Konidaris, Andre S. Barreto

Abstract: We introduce a skill discovery method for reinforcement learning in continuous domains that constructs chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates appropriate skills and achieves performance beneﬁts in a challenging continuous domain. 1

4 0.89405745 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

Author: Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, Joshua B. Tenenbaum

5 0.84301353 134 nips-2009-Learning to Explore and Exploit in POMDPs

Author: Chenghui Cai, Xuejun Liao, Lawrence Carin

Abstract: A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the speciﬁc problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

6 0.59404695 113 nips-2009-Improving Existing Fault Recovery Policies

7 0.54784107 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.49249581 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

9 0.46335748 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

10 0.44019368 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

11 0.42287108 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

12 0.40609497 39 nips-2009-Bayesian Belief Polarization

13 0.35182512 14 nips-2009-A Parameter-free Hedging Algorithm

14 0.34355846 52 nips-2009-Code-specific policy gradient rules for spiking neurons

15 0.32122144 69 nips-2009-Discrete MDL Predicts in Total Variation

16 0.30796674 221 nips-2009-Solving Stochastic Games

17 0.26374435 187 nips-2009-Particle-based Variational Inference for Continuous Systems

18 0.25389221 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

19 0.25183126 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

20 0.2478247 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.035), (25, 0.053), (35, 0.049), (36, 0.069), (39, 0.032), (58, 0.064), (61, 0.419), (71, 0.098), (81, 0.017), (86, 0.05), (91, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86738843 66 nips-2009-Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning

Author: Anne Hsu, Thomas L. Griffiths

Abstract: A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing restrictions on verb alternations, without negative evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate language-speciﬁc knowledge. However, recently, researchers have shown that statistical models are capable of learning complex rules from only positive evidence. These two kinds of learnability analyses differ in their assumptions about the distribution from which linguistic input is generated. The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution from which the sentences are generated, analogous to discriminative approaches in machine learning. The latter assume that learners are trying to estimate a generative model, with sentences being sampled from that model. We show that these two learning approaches differ in their use of implicit negative evidence – the absence of a sentence – when learning verb alternations, and demonstrate that human learners can produce results consistent with the predictions of both approaches, depending on how the learning problem is presented. 1

same-paper 2 0.85822672 242 nips-2009-The Infinite Partially Observable Markov Decision Process

Author: Finale Doshi-velez

3 0.79398823 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári

Abstract: We introduce the ﬁrst temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any ﬁnite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1

4 0.7274552 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

5 0.68843937 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

6 0.60944021 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

7 0.5710392 134 nips-2009-Learning to Explore and Exploit in POMDPs

8 0.5690071 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference

9 0.55602103 218 nips-2009-Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

10 0.51624972 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

11 0.50426131 113 nips-2009-Improving Existing Fault Recovery Policies

12 0.49876738 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

13 0.49606645 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

14 0.48010498 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

15 0.47450423 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains

16 0.46738431 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

17 0.45149729 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

18 0.4461205 48 nips-2009-Bootstrapping from Game Tree Search

19 0.44548649 52 nips-2009-Code-specific policy gradient rules for spiking neurons

20 0.4401589 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing