nips nips2012 nips2012-153 knowledge-graph by maker-knowledge-mining

153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model


Source: pdf

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? [sent-14, score-0.276]

2 The first posits an additive offset to a decision variable, implying a static effect of the prior. [sent-16, score-0.331]

3 However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. [sent-17, score-0.282]

4 Here we present a normative model of decision making that incorporates prior knowledge in a principled way. [sent-19, score-0.44]

5 We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). [sent-20, score-0.799]

6 We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. [sent-22, score-0.536]

7 Daw and Dayan [5, 6] were among the first to study decision theoretic and reinforcement learning models with the goal of interpreting results from various neurobiological experiments. [sent-28, score-0.24]

8 Bogacz and colleagues proposed a model that combines a traditional decision making model with reinforcement learning [7] (see also [8, 9]). [sent-29, score-0.357]

9 However, the LATER model fails to explain data from the random dots motion discrimination task [14] in which the agent is presented with noisy, time-varying stimuli and must continually process this data in order to make a correct choice and receive reward. [sent-32, score-0.522]

10 The drift diffusion model (DDM), which uses a random walk accumulation, instead of a linear rise to a boundary, has been successful in explaining behavioral and neurophysiological data in various perceptual discrimination tasks [14, 15, 16]. [sent-33, score-0.346]

11 However, in order to explain behavioral data from recent variants of random dots tasks in which the prior probability of motion direction is manipulated [13], DDMs require the additional assumption of dynamic reweighting of the influence of the prior over time. [sent-34, score-0.634]

12 Here, we present a normative framework for decision making that incorporates prior knowledge and noisy observations under a reward maximization hypothesis. [sent-35, score-0.603]

13 Our work is inspired by models which cast human and animal decision making in a rational, or optimal, framework. [sent-36, score-0.613]

14 [9] sought to explain the collapsing decision threshold by combining a traditional drift diffusion model with reward rate maximization; their model also requires knowledge of decision time in hindsight. [sent-40, score-0.804]

15 In this paper, we derive a novel POMDP model from which we compute the optimal behavior for sequential decision making tasks. [sent-41, score-0.326]

16 We demonstrate our model’s explanatory power on two such tasks: the random dots motion discrimination task [13] and Carpenter and Williams’ saccadic eye movement task [10]. [sent-42, score-0.727]

17 We show that the urgency signal, hypothesized in previous models, emerges naturally as a collapsing decision boundary with no assumption of a decision deadline. [sent-43, score-0.549]

18 Finally, the same model also accurately predicts the effect of prior probability changes on the distribution of reaction times in the Carpenter and Williams task, data that was previously interpreted in terms of the additive offset model. [sent-46, score-0.555]

19 1 Decision Making in a POMDP framework Model Setup We model a decision making task using a POMDP, which assumes that at any particular time step, t, the environment is in a particular hidden state, x ∈ X , that is not directly observable by the animal. [sent-48, score-0.452]

20 The animal can make sensory measurements in order to observe noisy samples of this hidden state. [sent-49, score-0.423]

21 At each time step, the animal receives an observation (stimulus), st , from the environment as determined by an emission distribution, Pr(st |x). [sent-50, score-0.494]

22 At each time step, the animal chooses an action, a ∈ A and receives an observation and a reward, R(x, a), from the environment, depending on the current state and the action taken. [sent-52, score-0.5]

23 The animal uses Bayes rule to update its belief about the environment after each observation. [sent-53, score-0.45]

24 Through these interactions, the animal learns a policy, π(b) ∈ A for all b, which dictates the action to take for each belief state. [sent-54, score-0.52]

25 For example, in the random dots motion discrimination task, the hidden state, x, is composed of both the coherence of the random dots c ∈ [0, 1] and the direction d ∈ {−1, 1} (corresponding to leftward and rightward motion, respectively), neither of which are known to the animal. [sent-56, score-0.831]

26 The 2 animal is shown a movie of randomly moving dots, a fraction of which are moving in the same direction (this fraction is the coherence). [sent-57, score-0.371]

27 Each frame, st , is a snapshot of the changes in dot positions, sampled from the emission distribution st ∼ Pr(st |kc, d), where k > 0 is a free parameter that determines the scale of st . [sent-59, score-0.332]

28 In order to discriminate the direction given the stimuli, the animal uses Bayes rule to compute the posterior probability of the static joint hidden state, Pr(x = kdc|s1:t )1 . [sent-60, score-0.444]

29 At each time step, the animal chooses one of three actions, a ∈ {AR , AL , AS }, denoting rightward eye movement, leftward eye movement, and sampling (i. [sent-61, score-0.685]

30 , a rightward eye movement a = AR when x > 0 or a leftward eye movement a = AL when x < 0), the animal receives a positive reward RP > 0. [sent-66, score-1.02]

31 The animal receives a negative reward (penalty) or no reward when an incorrect action is chosen, RN ≤ 0. [sent-67, score-0.74]

32 We assume that the animal is motivated by hunger or thirst to make a decision as quickly as possible and model this with a unit penalty RS = −1, representing the cost the agent needs to pay when choosing the sampling action AS . [sent-68, score-0.614]

33 2 Bayesian Inference of Hidden State from Prior Information and Noisy Observations In a POMDP, decisions are made based on the belief state bt (x) = Pr(x|s1:t ), which is the posterior probability distribution over x given a sequence of observations s1:t . [sent-70, score-0.523]

34 The initial belief b0 (x) represents the animal’s prior knowledge about x. [sent-71, score-0.245]

35 Unlike the traditional case where a full prior distribution is given, this direction-only prior information provides only partial knowledge about the hidden state which also includes coherence. [sent-74, score-0.336]

36 1 In the decision making tasks that we model in this paper, the hidden state is fixed within a trial and thus there is no transition distribution to include in the belief update equation. [sent-83, score-0.648]

37 3 Finding the optimal policy by reward maximization Within the POMDP framework, the animal’s goal is to find an optimal policy π ∗ (bt ) that maximizes its expected reward, starting at bt . [sent-86, score-0.814]

38 This is encapsulated in the value function ∞ v π (bt ) = E r(bt+k , π(bt+k )) | bt , π (5) k=1 where the expectation is taken with respect to all future belief states (bt+1 , . [sent-87, score-0.437]

39 ) given that the animal is using π to make decisions, and r(b, a) is the reward function over belief states or, equivalently, the expected reward over hidden states, r(b, a) = x R(x, a)b(x)dx. [sent-93, score-0.841]

40 In this model, the belief b is parameterized by st and t, so the animal only needs to keep track of these instead of encoding the ¯ entire posterior distribution bt (x) explicitly. [sent-95, score-0.845]

41 With probability Pr(dL ) · Φ(0 | µt , σt ), the hidden state x is less than 0, making AR an incorrect decision and resulting in a penalty RN if chosen. [sent-98, score-0.387]

42 When AS is selected, the animal simply receives an observation at a cost of RS . [sent-101, score-0.352]

43 The probability Pr(s | bt , AS ) can be treated as a normalization factor and is independent of hidden state2 . [sent-105, score-0.373]

44 Thus, the transition probability function, T (bt+1 | bt , AS ), is solely a function of the belief bt and is a stationary distribution over the belief space. [sent-106, score-0.874]

45 When the selected action is AL or AR , the animal stops sampling and makes an eye movement to the left or the right, respectively. [sent-107, score-0.576]

46 Moreover, whenever the animal chooses AL or AR , the POMDP immediately transitions into Γ: T (Γ|b, a ∈ {AL , AR }) = 1, ∀b, indicating the end of a trial. [sent-111, score-0.344]

47 Given the transition probability between belief states T (bt+1 |bt , a) and the reward function, we can convert our POMDP model into a Markov Decision Process (MDP) over the belief state. [sent-112, score-0.46]

48 Here, we derive the optimal policy from first principles and focus on comparisons between our model’s predictions and behavioral data. [sent-117, score-0.24]

49 Figure 1(a) shows the optimal policy π ∗ as a joint function of s and t for the unbiased case where ¯ the prior probability Pr(dR ) = Pr(dL ) = 0. [sent-128, score-0.27]

50 π ∗ partitions the belief space into three regions: ΠR , ΠL , and ΠS , representing the set of belief states preferring actions AR , AL and AS , respectively. [sent-130, score-0.302]

51 This gradual decrease in the threshold for choosing one of the non-sampling actions AR or AL has been called a “collapsing bound” in the decision making literature [21, 17, 22]. [sent-136, score-0.295]

52 For this unbiased prior case, the expected reward function is symmetric, r(bt (x), AR ) = r(Pr(x|¯t , t), AR ) = r(Pr(x| −¯t , t), AL ), and thus the decision boundaries are s s also symmetric around 0: ψR (t) = −ψL (t). [sent-137, score-0.471]

53 The optimal policy π ∗ is entirely determined by the reward parameters {RP , RN , RS } and the prior probability (the standard deviation of the emission distribution σe only determines the temporal resolution of the POMDP). [sent-138, score-0.476]

54 It applies to both Carpenter and Williams’ task and the random dots task (these two tasks differ only in the interpretation of the hidden state x). [sent-139, score-0.41]

55 The optimal action at a specific belief state is determined by the relative, not the absolute, value of the expected future reward. [sent-140, score-0.286]

56 (10) Moreover, if the unit of reward is specified by the sampling penalty, the optimal policy π ∗ is entirely −R determined by the ratio RNRS P and the prior. [sent-142, score-0.321]

57 As the prior probability becomes biased, the optimal policy becomes asymmetric. [sent-143, score-0.27]

58 Later in a trial, even for some belief states s with s < 0, the optimal action is still AR , because the effect of the prior is stronger than that of the ¯ observed data. [sent-147, score-0.347]

59 Blue solid curves are model predictions in the neutral case while green dotted curves are model predictions from the −R biased case. [sent-153, score-0.332]

60 Recall that the belief b is parametrized by st and t, so the animal only needs to know the elapsed time and compute a run¯ ning average st of the observations in order to maintain the posterior belief bt (x). [sent-166, score-1.113]

61 Given its current ¯ belief, the animal selects an action from the optimal policy π ∗ (bt ). [sent-167, score-0.545]

62 When bt ∈ ΠS , the animal chooses the sampling action and gets a new observation st+1 . [sent-168, score-0.714]

63 Otherwise the animal terminates the trial by making an eye movement to the right or to the left, for st > ψR (t) or st < ψL (t), ¯ ¯ respectively. [sent-169, score-0.826]

64 The performance on the task using the optimal policy can be measured in terms of both the accuracy of direction discrimination (the so-called psychometric function), and the reaction time required to reach a decision (the chronometric function). [sent-170, score-1.046]

65 The hidden variable x = kdc encapsulates the unknown direction and coherence, as well as the free parameter k that determines the scale of stimulus st . [sent-171, score-0.295]

66 Given an optimal policy, we compute both the psychometric and chronometric function by simulating a large number of trials (10000 trials per data point) and collecting the reaction time and chosen direction from each trial. [sent-173, score-0.569]

67 The upper panels of figure 2(a) and 2(b) (blue curves) show the performance accuracy as a function of coherence for both the model (blue solid curve) and the human subjects (blue dots) for neutral prior Pr(dR ) = 0. [sent-174, score-0.375]

68 The lower panels of figure 2(a) and 2(b) (blue solid curves) shows the predicted mean reaction time for correct choices as a function of coherence c for our model (blue solid curve, with same model parameters) and the data (blue dots). [sent-177, score-0.482]

69 Note that our model’s predicted reaction times represent the expected number of POMDP time steps before making a rightward eye movement AR , which we can directly compare to the monkey’s experimental data in units of real time. [sent-178, score-0.597]

70 A linear regression is used to determine the duration τ of a single time step and the onset of decision time tnd . [sent-179, score-0.293]

71 We applied the experimental mean reaction time reported in [13] with motion coherence c = 0. [sent-181, score-0.403]

72 In the POMDP model we propose, we predict both the accuracy and reaction times in the biased setting (green curves in figure 2) with the parameters learned in the neutral case, and achieve a good fit (with the coefficients of determination shown in fig. [sent-196, score-0.435]

73 Our model predictions for the biased cases are a direct result of the reward maximization component of our framework and require no additional parameter fitting. [sent-199, score-0.301]

74 50ms, and predict both the psychometric function and the reaction times in the biased case. [sent-203, score-0.445]

75 3 Reaction times in the Carpenter and Williams’ task (a) (b) Figure 3: Model predictions of saccadic eye movement in Carpenter & Williams’ experiments [10]. [sent-212, score-0.375]

76 Black dots are from experimental data and blue dots are model predictions. [sent-217, score-0.395]

77 RS In Carpenter and Williams’ task, the animal needs to decide on which side d ∈ {−1, 1} (denoting left or right side) a target light appeared at a fixed distance from a central fixation light. [sent-223, score-0.313]

78 Under the POMDP framework, Carpenter and Williams’ task and the random dots task differ in the interpretation of hidden state x and stimulus s, but they follow the same optimal policy given the same reward parameters. [sent-230, score-0.737]

79 Without loss of generality, we set the hidden variable x > 0 and say that the animal makes a correct choice at a hitting time tH when the animal’s belief state reaches the right boundary. [sent-231, score-0.57]

80 Since, for small t, ψR (t) behaves like a simple reciprocal function of t, the reciprocal of the reaction time is 2 approximately proportional to a normal distribution with t1 ∼ N (1/tH | k, σe ). [sent-233, score-0.411]

81 In figure 3(a), H we plot the distribution of reciprocal reaction time with different values of Pr(dR ) on a probit scale (similar to [10]). [sent-234, score-0.319]

82 If the reciprocal of reaction time (with the same prior Pr(dR )) follows a normal distribution, each point on the graph will fall on a straight line with √ y-intercept kσe2 that is independent of Pr(dR ). [sent-236, score-0.497]

83 Figure 3(b) demonstrates that the median of our model’s reaction times is a linear function of the log of the prior probability. [sent-241, score-0.367]

84 Increasing the prior probability lowers the decision boundary ψR (t), effectively decreasing the latency. [sent-242, score-0.362]

85 4 Summary and Conclusion Our results suggest that decision making in the primate brain may be governed by the dual principles of Bayesian inference and reward maximization as implemented within the framework of partially observable Markov decision processes (POMDPs). [sent-245, score-0.729]

86 The model provides a unified explanation for experimental data previously explained by two competing models, namely, the additive offset model and the dynamic weighting model for incorporating prior knowledge. [sent-246, score-0.316]

87 In particular, the model predicts psychometric and chronometric data for the random dots motion discrimination task [13] as well as Carpenter and Williams’ saccadic eye movement task [10]. [sent-247, score-1.008]

88 Our model relies on the principle of reward maximization to explain how an animal’s decisions are influenced by changes in prior probability. [sent-250, score-0.407]

89 Specifically, the model predicts that −R the optimal policy π ∗ is determined by the ratio RNRS P and the prior probability Pr(dR ). [sent-252, score-0.331]

90 Since the reward parameters in our model represent internal reward, our model also provides a bridge to study the relationship between physical reward and subjective reward. [sent-254, score-0.372]

91 In our model of the random dots discrimination task, belief is expressed in terms of a piecewise normal distribution with the domain of the hidden variable x ∈ (−∞, ∞). [sent-255, score-0.582]

92 The belief in our model can be expressed by any distribution, even a non-parametric one, as long as the observation model provides a faithful representation of the stimuli and captures the essential relationship between the stimuli and the hidden world state. [sent-259, score-0.396]

93 The POMDP model provides a unifying framework for a variety of perceptual decision making tasks. [sent-260, score-0.36]

94 Our model is suitable for decision tasks with time-varying state and observations that are time dependent within a trial (as long as they are conditional independent given the time-varying hidden state sequence). [sent-264, score-0.495]

95 Integration of reinforcement learning and optimal decision making theories of the basal ganglia. [sent-312, score-0.335]

96 The cost of accumulating evidence in perceptual decision making. [sent-330, score-0.27]

97 The relative influences of priors and sensory evidence on an oculomotor decision variable during perceptual learning. [sent-357, score-0.307]

98 Elapsed decision time affects the weighting of prior probability in a perceptual decision task. [sent-369, score-0.582]

99 Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. [sent-376, score-0.356]

100 The diffusion decision model: Theory and data for two-choice decision tasks. [sent-391, score-0.448]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pr', 0.328), ('animal', 0.313), ('bt', 0.3), ('dr', 0.268), ('reaction', 0.259), ('decision', 0.204), ('pomdp', 0.179), ('dots', 0.168), ('carpenter', 0.162), ('reward', 0.159), ('ar', 0.145), ('belief', 0.137), ('psychometric', 0.136), ('rnrs', 0.134), ('policy', 0.13), ('eye', 0.109), ('prior', 0.108), ('saccadic', 0.102), ('discrimination', 0.097), ('offset', 0.097), ('st', 0.095), ('dl', 0.089), ('tnd', 0.089), ('al', 0.084), ('movement', 0.084), ('chronometric', 0.084), ('rightward', 0.082), ('motion', 0.075), ('hidden', 0.073), ('neutral', 0.07), ('action', 0.07), ('coherence', 0.069), ('trial', 0.067), ('stimuli', 0.066), ('perceptual', 0.066), ('intercept', 0.064), ('making', 0.063), ('rs', 0.062), ('williams', 0.061), ('rp', 0.061), ('reciprocal', 0.06), ('direction', 0.058), ('collapsing', 0.058), ('latency', 0.056), ('monkey', 0.051), ('biased', 0.05), ('boundary', 0.05), ('piecewise', 0.048), ('state', 0.047), ('emission', 0.047), ('task', 0.046), ('behavioral', 0.044), ('bogacz', 0.044), ('explain', 0.043), ('drift', 0.042), ('monkeys', 0.041), ('lh', 0.041), ('leftward', 0.041), ('diffusion', 0.04), ('receives', 0.039), ('observable', 0.039), ('decisions', 0.039), ('normative', 0.038), ('hanks', 0.038), ('straight', 0.038), ('sensory', 0.037), ('elapsed', 0.036), ('stimulus', 0.036), ('panels', 0.036), ('reinforcement', 0.036), ('pomdps', 0.035), ('zt', 0.035), ('predicts', 0.034), ('rn', 0.034), ('predictions', 0.034), ('drugowitsch', 0.033), ('frazier', 0.033), ('kdc', 0.033), ('urgency', 0.033), ('gure', 0.033), ('human', 0.033), ('optimal', 0.032), ('blue', 0.032), ('normal', 0.032), ('solid', 0.032), ('lines', 0.032), ('dayan', 0.031), ('chooses', 0.031), ('maximization', 0.031), ('tasks', 0.03), ('additive', 0.03), ('daw', 0.03), ('mazurek', 0.03), ('curves', 0.029), ('brain', 0.029), ('actions', 0.028), ('neurosci', 0.028), ('uences', 0.028), ('neuroscience', 0.027), ('model', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

2 0.14237027 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin

Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1

3 0.14219473 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

Author: Jiarong Jiang, Adam Teichert, Jason Eisner, Hal Daume

Abstract: Users want inference to be both fast and accurate, but quality often comes at the cost of speed. The field has experimented with approximate inference algorithms that make different speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing [12]. Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” follows a far better policy than anything in our learner’s policy space, free of the speed-accuracy tradeoff that arises when oracle information is unavailable, and thus largely insensitive to the known reward functfion. We propose a hybrid reinforcement/apprenticeship learning algorithm that learns to speed up an initial policy, trading off accuracy for speed according to various settings of a speed term in the loss function. 1

4 0.13669738 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

5 0.12898174 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

Author: Jaedeug Choi, Kee-eung Kim

Abstract: We present a nonparametric Bayesian approach to inverse reinforcement learning (IRL) for multiple reward functions. Most previous IRL algorithms assume that the behaviour data is obtained from an agent who is optimizing a single reward function, but this assumption is hard to guarantee in practice. Our approach is based on integrating the Dirichlet process mixture model into Bayesian IRL. We provide an efficient Metropolis-Hastings sampling algorithm utilizing the gradient of the posterior to estimate the underlying reward functions, and demonstrate that our approach outperforms previous ones via experiments on a number of problem domains. 1

6 0.1228067 38 nips-2012-Algorithms for Learning Markov Field Policies

7 0.11668558 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

8 0.11641524 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

9 0.11625334 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

10 0.11545683 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

11 0.11104196 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

12 0.11089411 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.11036023 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

14 0.10989828 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

15 0.10969132 344 nips-2012-Timely Object Recognition

16 0.10590564 348 nips-2012-Tractable Objectives for Robust Policy Optimization

17 0.10311259 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference

18 0.1005915 41 nips-2012-Ancestor Sampling for Particle Gibbs

19 0.099776633 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

20 0.098179415 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, -0.239), (2, -0.063), (3, 0.05), (4, -0.083), (5, 0.043), (6, 0.014), (7, 0.01), (8, -0.003), (9, 0.028), (10, -0.037), (11, -0.064), (12, 0.038), (13, 0.006), (14, -0.011), (15, -0.046), (16, -0.003), (17, 0.009), (18, -0.024), (19, -0.036), (20, 0.044), (21, 0.106), (22, -0.151), (23, -0.014), (24, 0.057), (25, -0.087), (26, 0.04), (27, 0.032), (28, 0.159), (29, -0.047), (30, 0.059), (31, 0.031), (32, 0.054), (33, -0.032), (34, 0.057), (35, 0.126), (36, 0.042), (37, -0.11), (38, 0.003), (39, 0.026), (40, -0.088), (41, -0.043), (42, 0.089), (43, -0.088), (44, 0.002), (45, 0.034), (46, -0.054), (47, -0.145), (48, -0.162), (49, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96211696 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

2 0.66348869 51 nips-2012-Bayesian Hierarchical Reinforcement Learning

Author: Feng Cao, Soumya Ray

Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1

3 0.65581447 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

Author: Trung Nguyen, Tomi Silander, Tze Y. Leong

Abstract: We study how to automatically select and adapt multiple abstractions or representations of the world to support model-based reinforcement learning. We address the challenges of transfer learning in heterogeneous environments with varying tasks. We present an efficient, online framework that, through a sequence of tasks, learns a set of relevant representations to be used in future tasks. Without predefined mapping strategies, we introduce a general approach to support transfer learning across different state spaces. We demonstrate the potential impact of our system through improved jumpstart and faster convergence to near optimum policy in two benchmark domains. 1

4 0.62038267 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning

Author: Dongho Kim, Kee-eung Kim, Pascal Poupart

Abstract: In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected longterm total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems. 1

5 0.59815061 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

Author: Manuel Lopes, Tobias Lang, Marc Toussaint, Pierre-yves Oudeyer

Abstract: Formal exploration approaches in model-based reinforcement learning estimate the accuracy of the currently learned model without consideration of the empirical prediction error. For example, PAC-MDP approaches such as R- MAX base their model certainty on the amount of collected data, while Bayesian approaches assume a prior over the transition dynamics. We propose extensions to such approaches which drive exploration solely based on empirical estimates of the learner’s accuracy and learning progress. We provide a “sanity check” theoretical analysis, discussing the behavior of our extensions in the standard stationary finite state-action case. We then provide experimental studies demonstrating the robustness of these exploration measures in cases of non-stationary environments or where original approaches are misled by wrong domain assumptions. 1

6 0.56106937 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

7 0.54579711 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

8 0.53622693 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

9 0.53371036 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

10 0.5188722 2 nips-2012-3D Social Saliency from Head-mounted Cameras

11 0.51723975 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

12 0.51578814 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

13 0.49737519 183 nips-2012-Learning Partially Observable Models Using Temporally Abstract Decision Trees

14 0.49684817 31 nips-2012-Action-Model Based Multi-agent Plan Recognition

15 0.4936423 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

16 0.47265574 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

17 0.46741831 350 nips-2012-Trajectory-Based Short-Sighted Probabilistic Planning

18 0.46633756 45 nips-2012-Approximating Equilibria in Sequential Auctions with Incomplete Information and Multi-Unit Demand

19 0.4602887 288 nips-2012-Rational inference of relative preferences

20 0.45047736 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (11, 0.014), (17, 0.05), (21, 0.033), (29, 0.198), (38, 0.106), (42, 0.039), (53, 0.012), (54, 0.088), (55, 0.028), (74, 0.043), (76, 0.146), (80, 0.076), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85016561 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi

Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1

same-paper 2 0.83504361 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Author: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao

Abstract: How does the brain combine prior knowledge with sensory evidence when making decisions under uncertainty? Two competing descriptive models have been proposed based on experimental data. The first posits an additive offset to a decision variable, implying a static effect of the prior. However, this model is inconsistent with recent data from a motion discrimination task involving temporal integration of uncertain sensory evidence. To explain this data, a second model has been proposed which assumes a time-varying influence of the prior. Here we present a normative model of decision making that incorporates prior knowledge in a principled way. We show that the additive offset model and the time-varying prior model emerge naturally when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). Decision making in the model reduces to (1) computing beliefs given observations and prior information in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. We show that the model can explain both data previously explained using the additive offset model as well as more recent data on the time-varying influence of prior knowledge on decision making. 1

3 0.73836315 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data

Author: James Lloyd, Peter Orbanz, Zoubin Ghahramani, Daniel M. Roy

Abstract: A fundamental problem in the analysis of structured relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. 1

4 0.73490673 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller

Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1

5 0.73370516 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

6 0.73295671 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

7 0.72882569 344 nips-2012-Timely Object Recognition

8 0.71961266 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

9 0.71787882 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

10 0.71608293 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

11 0.71421069 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

12 0.71418303 38 nips-2012-Algorithms for Learning Markov Field Policies

13 0.71004164 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

14 0.70878643 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed

15 0.70851529 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

16 0.70711279 348 nips-2012-Tractable Objectives for Robust Policy Optimization

17 0.70658976 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

18 0.70655096 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

19 0.7063691 62 nips-2012-Burn-in, bias, and the rationality of anchoring

20 0.7063691 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning