nips nips2005 nips2005-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Humans are extremely adept at learning new skills by imitating the actions of others. [sent-9, score-0.255]
2 A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. [sent-10, score-1.014]
3 In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. [sent-11, score-1.027]
4 We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. [sent-12, score-0.659]
5 We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. [sent-13, score-0.271]
6 Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete. [sent-15, score-1.69]
7 Research over the past decade has shown that even newborns can imitate simple body movements (such as facial actions) [1]. [sent-18, score-0.23]
8 While the neural mechanisms underlying imitation remain unclear, recent research has revealed the existence of “mirror neurons” in the primate brain which fire both when a monkey watches an action or when it performs the same action [2]. [sent-19, score-0.824]
9 The most sophisticated forms of imitation are those that require an ability to infer the underlying goals and intentions of a teacher. [sent-20, score-0.612]
10 In this case, the imitating agent attributes not only visible behaviors to others, but also utilizes the idea that others have internal mental states that underlie, predict, and generate these visible behaviors. [sent-21, score-0.314]
11 For example, infants that are about 18 months old can readily imitate actions on objects, e. [sent-22, score-0.508]
12 More interestingly, they can imitate this action even when the adult actor accidentally under- or overshot his target, or the hands slipped several times, leaving the goal-state unachieved (Fig. [sent-26, score-0.473]
13 They were thus presumably able to infer the actor’s goal, which remained unfulfilled, and imitate not the observed action but the intended one. [sent-28, score-0.459]
14 In this paper, we propose a model for intent inference and goal-based imitation that utilizes probabilistic inference over graphical models. [sent-29, score-0.932]
15 We first describe how the basic problems of planning an action sequence and learning policies (state to action mappings) can be solved through probabilistic inference. [sent-30, score-0.621]
16 We then illustrate the applicability of the learned graphical model to the problems of goal inference and imitation. [sent-31, score-0.499]
17 Imitation is achieved by using one’s learned policies to reach an inferred goal state. [sent-33, score-0.443]
18 (a) (b) Figure 1: Example of Goal-Based Imitation by Infants: (a) Infants as young as 14 months old can imitate actions on objects as seen on TV (from [4]). [sent-36, score-0.382]
19 Infants were subsequently able to correctly infer the intent of the actor and successfully complete the act (from [3]). [sent-38, score-0.259]
20 2 Graphical Models We first describe how graphical models can be used to plan action sequences and learn goal-based policies, which can subsequently be used for goal inference and imitation. [sent-39, score-0.652]
21 Let ΩS be the set of states in the environment, ΩA the set of all possible actions available to the agent, and ΩG the set of possible goals. [sent-40, score-0.237]
22 Each goal g represents a target state Goalg ∈ ΩS . [sent-42, score-0.292]
23 At time t the agent is in state st and executes action at . [sent-43, score-0.662]
24 gt represents the current goal that the agent is trying to reach at time t. [sent-44, score-0.591]
25 Executing the action at changes the agent’s state in a stochastic manner given by the transition probability P (st+1 | st , at ), which is assumed to be independent of t i. [sent-45, score-0.526]
26 Starting from an initial state s1 = s and a desired goal state g, planning involves computing a series of actions a1:T to reach the goal state, where T represents the maximum number of time steps allowed (the “episode length”). [sent-48, score-0.975]
27 Also, when obvious from the context, we use s for st = s, a for at = a and g for gt = g. [sent-51, score-0.264]
28 In the case where the state st is fully observed, we obtain the graphical model in Fig. [sent-52, score-0.402]
29 The agent needs to compute a stochastic policy πt (a | s, g) that maximizes the probability ˆ P (sT +1 = Goalg | st = s, gt = g). [sent-54, score-0.74]
30 For a large time horizon (T 1), the policy is independent of t i. [sent-55, score-0.252]
31 A more realistic ˆ π scenario is where the state st is hidden but some aspects of it are visible. [sent-58, score-0.252]
32 We additionally include a goal variable g t and a “reached” variable rt , resulting in the graphical model in Fig. [sent-64, score-0.455]
33 The goal variable gt represents the current goal the agent is trying to reach while the variable r t is a boolean variable that assumes the value 1 whenever the current state equals the current goal state and 0 otherwise. [sent-66, score-1.233]
34 We use rt to help infer the shortest path to the goal state (given an upper bound T on path length); this is done by constraining the actions that can be selected once the goal state is reached (see next section). [sent-67, score-1.168]
35 Note that rt can also be used to model the switching of goal states (once a goal is reached) and to implement hierarchical extensions of the present model. [sent-68, score-0.62]
36 The current action at now depends not only on the current state but also on the current goal gt , and whether we have reached the goal (as indicated by rt ). [sent-69, score-1.065]
37 Each action takes the agent into the intended cell with a high probability. [sent-74, score-0.418]
38 This probability is governed by the noise parameter η, which is the probability that the agent will end up in one of the adjoining (non-wall) squares or remain in the same square. [sent-75, score-0.252]
39 The problem of planning can then be stated as follows: Given a goal state g, an initial state s, and number of time steps T , what is the sequence of actions a 1:T that maximizes the ˆ probability of reaching the goal state? [sent-81, score-0.941]
40 We compute these actions using the most probable explanation (MPE) method, a standard routine in graphical model packages (see [7] for an alternate approach). [sent-82, score-0.33]
41 2b, we obtain: a1:T , s2:T +1 , g1:T , r1:T = argmax P (a1:T , s2:T , g1:T , r1:T | s1 = s, sT +1 = Goalg ) ¯ ¯ ¯ ¯ (1) When using the MPE method, the “reached” variable rt can be used to compute the shortest path to the goal. [sent-84, score-0.261]
42 This breaks the isomorphism of the MPE action sequences with respect to the stayput action, i. [sent-86, score-0.242]
43 Thus, the stayput action is discouraged unless the agent has reached the goal. [sent-89, score-0.524]
44 This technique is quite general, in the sense that we can always augment ΩA with a no-op action and use this technique based on rt to push the no-op actions to the end of a T -length action sequence for a pre-chosen upper bound T . [sent-90, score-0.658]
45 2 Policy Learning using Planning Executing a plan in a noisy environment may not always result in the goal state being reached. [sent-99, score-0.462]
46 However, in the instances where a goal state is indeed reached, the executed action sequence can be used to bootstrap the learning of an optimal policy π (a | s, g), ˆ which represents the probability for action a in state s when the goal state to be reached is g. [sent-100, score-1.475]
47 We define optimality in terms of reaching the goal using the shortest path. [sent-101, score-0.252]
48 Note that the optimal policy may differ from the prior P (a|s, g) which counts all actions executed in state s for goal g, regardless of whether the plan was successful. [sent-102, score-0.836]
49 The agent selects a random start state and a goal state (according to P (g 1 )), infers the MPE plan a1:T using the current τ , executes it, and updates the frequency counts for ¯ τs sa based on the observed st and st+1 for each at . [sent-106, score-0.955]
50 The policy π (a | s, g) is only updated ˆ (by updating the action frequencies) if the goal g was reached. [sent-107, score-0.625]
51 The plan is executed to record observations o2:T +1 , which are then used to compute the MPE estimate for the hidden states:¯o +1 , g1:T , r1:T +1 = s1:T ¯ ¯ argmax P (s1:T +1 , g1:T , r1:T +1 | o1:T +1 , a1:T , gT +1 = g). [sent-112, score-0.22]
52 ˆ 1:T Results: Figure 4a shows the error in the learned transition model and policy as a function of the number of iterations of the algorithm. [sent-114, score-0.423]
53 Error in τs sa was defined as the squared sum of differences between the learned and true transition parameters. [sent-115, score-0.222]
54 Error in the learned policy was defined as the number of disagreements between the optimal deterministic pol- Algorithm 1 Policy learning in an unknown environment 1: Initialize transition model τs sa , policy π(a | s, g), α, and numT rials. [sent-116, score-0.815]
55 1:T 11: If the plan was successful, update policy π (a | s, g) using a1:T and so +1 . [sent-124, score-0.33]
56 ˆ 1:T 12: α = α×γ 13: end for icy for each goal computed via policy iteration and argmax π (a | s, g), summed over all ˆ a goals. [sent-125, score-0.492]
57 The policy error decreases only after the transition model error becomes significantly small because without an accurate estimate of τ , the MPE plan is typically incorrect and the agent rarely reaches the goal state, resulting in little or no learning of the policy. [sent-127, score-0.806]
58 4b shows the maximum probability action argmax π (a | s, g) learned for each state (maze location) for one of the ˆ a goals. [sent-129, score-0.442]
59 It is clear that the optimal action has been learned by the algorithm for all locations to reach the given goal state. [sent-130, score-0.517]
60 The policy error decreases but does not reach zero because of perceptual ambiguity at certain locations such as corners, where two (or more) actions may have roughly equal probability (see Fig. [sent-133, score-0.522]
61 t the true transition model and optimal policy for the maze MDP. [sent-138, score-0.483]
62 (b) The optimal policy learned for one of the 3 goals. [sent-139, score-0.334]
63 The long arrows represent the maximum probability action while the short arrows show all the high probability actions when there is no clear winner. [sent-141, score-0.418]
64 4 Inferring Intent and Goal-Based Imitation Consider a task where the agent gets observations o1:t from observing a teacher and seeks to imitate the teacher. [sent-142, score-0.826]
65 Also, for P (a|s, g, rt = 0), we use the policy π (a | s, g) learned as in the previous section. [sent-145, score-0.448]
66 The goal of the agent is to infer the intention ˆ of the teacher given a (possibly incomplete) demonstration and to reach the intended goal using its policy (which could be different from the teacher’s optimal policy). [sent-146, score-1.414]
67 Using the graphical model formulation the problem of goal inference reduces to finding the marginal P (gT | o1:t ), which can be efficiently computed using standard techniques such as belief propagation. [sent-147, score-0.417]
68 Imitation is accomplished by choosing the goal with the highest probability and executing actions to reach that goal. [sent-148, score-0.508]
69 5a shows the results of goal inference for the set of noisy teacher observations in Fig. [sent-150, score-0.752]
70 Note that the inferred goal probabilities correctly reflect the putative goal(s) of the teacher at each point in the teacher trajectory. [sent-153, score-1.025]
71 In addition, even though the teacher demonstration is incomplete, the imitator can perform goal-based imitation by inferring the teacher’s most likely goal as shown in Fig. [sent-154, score-1.177]
72 This mimics the results reported by [3] on the intent inference by infants. [sent-156, score-0.215]
73 2 0 2 4 6 8 10 12 t (b) Teacher Observations (c) Imitator States 5 6 4 7 2 3 1 8 9 10 11 3 12 2 4 5 6 7 8 9 10 1 Figure 5: Goal Inference and Goal-Based Imitation: (a) shows the goal probabilities inferred at each time step from teacher observations. [sent-161, score-0.627]
74 (b) shows the teacher observations, which are noisy and include a detour while en route to the red goal. [sent-162, score-0.491]
75 The teacher demonstration is incomplete and stops short of the red goal. [sent-163, score-0.471]
76 (c) The imitator infers the most likely goal using (a) and performs goalbased imitation while avoiding the detour (The numbers t in a cell in (b) and (c) represent ot and st respectively). [sent-164, score-0.998]
77 5 Online Imitation with Uncertain Goals Now consider a task where the goal is to imitate a teacher online (i. [sent-165, score-0.788]
78 The teacher observations are assumed to be corrupted by noise and may include significant periods of occlusion where no data is available. [sent-168, score-0.481]
79 The graphical model framework provides an elegant solution to the problem of planning and selecting actions when observations are missing and only a probability distribution over goals is available. [sent-169, score-0.679]
80 The best current action can be picked using the marginal P (a t | o1:t ), which can be computed efficiently for the graphical model in Fig. [sent-170, score-0.393]
81 , the policy for each goal weighted by the likelihood of that goal given past teacher observations, which corresponds to our intuition of how actions should be picked when goals are uncertain. [sent-174, score-1.357]
82 6a shows the inferred distribution over goal states as the teacher follows a trajectory given by the noisy observations in Fig. [sent-176, score-0.814]
83 Although the goal is uncertain and certain portions of the teacher trajectory are occluded1, the agent is still able to make progress towards regions 1 We simulated occlusion using a special observation symbol which carried no information about current state, i. [sent-179, score-0.926]
84 , P (occluded | s) = for all s ( 1) most likely to contain any probable goal states and is able to “catch-up” with the teacher when observations become available again (Fig. [sent-181, score-0.705]
85 Missing numbers indicate times at which the teacher was occluded. [sent-190, score-0.398]
86 (c) The agent is able to follow the teacher trajectory even when the teacher is occluded based on the evolving goal distribution in (a). [sent-191, score-1.255]
87 6 Conclusions We have proposed a new model for intent inference and goal-based imitation based on probabilistic inference in graphical models. [sent-192, score-0.906]
88 The model assumes an initial learning phase where the agent explores the environment (cf. [sent-193, score-0.285]
89 body babbling in infants [8]) and learned a graphical model capturing the sensory consequences of motor actions. [sent-194, score-0.423]
90 The learned model is then used for planning action sequences to goal states and for learning policies. [sent-195, score-0.686]
91 The resulting graphical model then serves as a platform for intent inference and goal-based imitation. [sent-196, score-0.397]
92 It extends the approach of [7] from planning in a traditional state-action Markov model to a full-fledged graphical model involving states, actions, and goals with edges for capturing conditional distributions denoting policies. [sent-198, score-0.471]
93 The indicator variable rt used in our approach is similar to the ones used in some hierarchical graphical models [9, 10, 11]. [sent-199, score-0.281]
94 Several models of imitation have previously been proposed [12, 13, 14, 15, 16, 17]; these models are typically not probabilistic and have focused on trajectory following rather than intent inference and goal-based imitation. [sent-201, score-0.723]
95 The Bayesian model requires both a learned environment model as well as a learned policy. [sent-203, score-0.278]
96 In the case of the maze example, these were learned using a relatively small number of trials due to small size of the state space. [sent-204, score-0.325]
97 The probabilistic model we have proposed also opens up the possibility of applying Bayesian methodologies such as manipulation of prior probabilities of task alternatives to obtain a deeper understanding of goal inference and imitation in humans. [sent-207, score-0.757]
98 We believe that the application of Bayesian techniques to imitation could shed new light on the problem of how infants acquire internal models of the people and objects they encounter in the world. [sent-215, score-0.553]
99 A Bayesian model of imitation in infants and robots. [sent-269, score-0.578]
100 Fixation behavior in observation and imitation of human movement. [sent-302, score-0.427]
wordName wordTfidf (topN-words)
[('imitation', 0.427), ('teacher', 0.398), ('policy', 0.252), ('agent', 0.196), ('goal', 0.191), ('action', 0.182), ('actions', 0.18), ('mpe', 0.174), ('imitate', 0.173), ('st', 0.151), ('planning', 0.149), ('maze', 0.142), ('intent', 0.139), ('infants', 0.126), ('graphical', 0.125), ('rt', 0.114), ('goals', 0.113), ('gt', 0.113), ('state', 0.101), ('reached', 0.086), ('learned', 0.082), ('actor', 0.08), ('imitator', 0.08), ('plan', 0.078), ('inference', 0.076), ('sa', 0.076), ('policies', 0.07), ('ot', 0.069), ('mdp', 0.068), ('environment', 0.064), ('transition', 0.064), ('reach', 0.062), ('shortest', 0.061), ('goalg', 0.06), ('stayput', 0.06), ('observations', 0.059), ('pomdp', 0.059), ('states', 0.057), ('imitative', 0.052), ('argmax', 0.049), ('executing', 0.047), ('uncertain', 0.045), ('demonstration', 0.044), ('trajectory', 0.043), ('rao', 0.042), ('hierarchical', 0.042), ('plans', 0.042), ('infer', 0.04), ('intended', 0.04), ('cse', 0.04), ('dautenhahn', 0.04), ('deepak', 0.04), ('deptt', 0.04), ('detour', 0.04), ('goalbased', 0.04), ('numt', 0.04), ('skills', 0.04), ('robots', 0.04), ('inferred', 0.038), ('adult', 0.038), ('probabilistic', 0.038), ('inferring', 0.037), ('path', 0.037), ('humanoid', 0.035), ('imitating', 0.035), ('capturing', 0.034), ('executed', 0.034), ('brain', 0.033), ('platform', 0.032), ('intentions', 0.032), ('meltzoff', 0.032), ('executes', 0.032), ('picked', 0.032), ('body', 0.031), ('robot', 0.031), ('months', 0.029), ('observability', 0.029), ('occluded', 0.029), ('pomdps', 0.029), ('prob', 0.029), ('incomplete', 0.029), ('current', 0.029), ('help', 0.029), ('noisy', 0.028), ('humans', 0.028), ('robotic', 0.028), ('navigation', 0.028), ('seattle', 0.028), ('probability', 0.028), ('mirror', 0.026), ('utilizes', 0.026), ('bootstrap', 0.026), ('utilizing', 0.026), ('facial', 0.026), ('online', 0.026), ('route', 0.025), ('model', 0.025), ('occlusion', 0.024), ('animals', 0.024), ('presumably', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
2 0.30859083 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
3 0.24038965 78 nips-2005-From Weighted Classification to Policy Search
Author: Doron Blatt, Alfred O. Hero
Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1
4 0.23925412 153 nips-2005-Policy-Gradient Methods for Planning
Author: Douglas Aberdeen
Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1
5 0.23121838 144 nips-2005-Off-policy Learning with Options and Recognizers
Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh
Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.
6 0.22998586 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
7 0.17486045 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
8 0.16339253 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
9 0.14694414 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
10 0.13766021 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
11 0.13117436 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
12 0.10153265 111 nips-2005-Learning Influence among Interacting Markov Chains
13 0.094450347 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
14 0.091541909 53 nips-2005-Cyclic Equilibria in Markov Games
15 0.087692909 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
16 0.081895486 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine
17 0.079471976 148 nips-2005-Online Discovery and Learning of Predictive State Representations
18 0.072315261 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks
19 0.067271575 45 nips-2005-Conditional Visual Tracking in Kernel Space
20 0.063131668 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
topicId topicWeight
[(0, 0.197), (1, -0.054), (2, 0.525), (3, -0.072), (4, -0.047), (5, -0.049), (6, 0.098), (7, 0.015), (8, 0.016), (9, -0.077), (10, -0.018), (11, 0.017), (12, -0.007), (13, 0.044), (14, 0.019), (15, -0.045), (16, 0.034), (17, 0.008), (18, -0.037), (19, 0.122), (20, 0.07), (21, -0.021), (22, 0.056), (23, 0.01), (24, 0.036), (25, -0.085), (26, 0.015), (27, -0.064), (28, 0.042), (29, -0.005), (30, -0.023), (31, 0.047), (32, -0.01), (33, 0.002), (34, -0.055), (35, 0.015), (36, -0.007), (37, 0.05), (38, -0.009), (39, 0.04), (40, -0.004), (41, -0.064), (42, 0.101), (43, 0.014), (44, 0.011), (45, -0.004), (46, 0.036), (47, -0.007), (48, -0.032), (49, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.97079688 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
2 0.83694243 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
3 0.8060773 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
Author: Drew Bagnell, Andrew Y. Ng
Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1
4 0.77382261 153 nips-2005-Policy-Gradient Methods for Planning
Author: Douglas Aberdeen
Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1
5 0.71554798 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy
Abstract: We propose a mean-field approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1
6 0.70925736 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
7 0.65116894 144 nips-2005-Off-policy Learning with Options and Recognizers
8 0.58000755 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
9 0.5713836 78 nips-2005-From Weighted Classification to Policy Search
10 0.50981385 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
11 0.45271936 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
12 0.39072335 53 nips-2005-Cyclic Equilibria in Markov Games
13 0.3789247 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine
14 0.37087023 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
15 0.36109406 111 nips-2005-Learning Influence among Interacting Markov Chains
16 0.30435249 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
17 0.28818974 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains
18 0.287328 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
19 0.27786487 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks
20 0.27775848 45 nips-2005-Conditional Visual Tracking in Kernel Space
topicId topicWeight
[(3, 0.033), (10, 0.03), (27, 0.475), (31, 0.088), (34, 0.049), (39, 0.016), (41, 0.01), (44, 0.013), (55, 0.022), (69, 0.052), (73, 0.017), (88, 0.054), (91, 0.026), (92, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.9278394 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
2 0.91679126 185 nips-2005-Subsequence Kernels for Relation Extraction
Author: Raymond J. Mooney, Razvan C. Bunescu
Abstract: We present a new kernel method for extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. This kernel uses three types of subsequence patterns that are typically employed in natural language to assert relationships between two entities. Experiments on extracting protein interactions from biomedical corpora and top-level relations from newspaper corpora demonstrate the advantages of this approach. 1
3 0.9145717 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
Author: Yan Karklin, Michael S. Lewicki
Abstract: Linear implementations of the efficient coding hypothesis, such as independent component analysis (ICA) and sparse coding models, have provided functional explanations for properties of simple cells in V1 [1, 2]. These models, however, ignore the non-linear behavior of neurons and fail to match individual and population properties of neural receptive fields in subtle but important ways. Hierarchical models, including Gaussian Scale Mixtures [3, 4] and other generative statistical models [5, 6], can capture higher-order regularities in natural images and explain nonlinear aspects of neural processing such as normalization and context effects [6,7]. Previously, it had been assumed that the lower level representation is independent of the hierarchy, and had been fixed when training these models. Here we examine the optimal lower-level representations derived in the context of a hierarchical model and find that the resulting representations are strikingly different from those based on linear models. Unlike the the basis functions and filters learned by ICA or sparse coding, these functions individually more closely resemble simple cell receptive fields and collectively span a broad range of spatial scales. Our work unifies several related approaches and observations about natural image structure and suggests that hierarchical models might yield better representations of image structure throughout the hierarchy.
4 0.76867294 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
Author: Jo-anne Ting, Aaron D'souza, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato
Abstract: An increasing number of projects in neuroscience requires the statistical analysis of high dimensional data sets, as, for instance, in predicting behavior from neural firing or in operating artificial devices from brain recordings in brain-machine interfaces. Linear analysis techniques remain prevalent in such cases, but classical linear regression approaches are often numerically too fragile in high dimensions. In this paper, we address the question of whether EMG data collected from arm movements of monkeys can be faithfully reconstructed with linear approaches from neural activity in primary motor cortex (M1). To achieve robust data analysis, we develop a full Bayesian approach to linear regression that automatically detects and excludes irrelevant features in the data, regularizing against overfitting. In comparison with ordinary least squares, stepwise regression, partial least squares, LASSO regression and a brute force combinatorial search for the most predictive input features in the data, we demonstrate that the new Bayesian method offers a superior mixture of characteristics in terms of regularization against overfitting, computational efficiency and ease of use, demonstrating its potential as a drop-in replacement for other linear regression techniques. As neuroscientific results, our analyses demonstrate that EMG data can be well predicted from M1 neurons, further opening the path for possible real-time interfaces between brains and machines. 1
5 0.57550156 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
6 0.57447165 109 nips-2005-Learning Cue-Invariant Visual Responses
7 0.52710688 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
8 0.52494228 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
9 0.50226218 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
10 0.49349377 153 nips-2005-Policy-Gradient Methods for Planning
11 0.48570567 203 nips-2005-Visual Encoding with Jittering Eyes
12 0.4544704 158 nips-2005-Products of ``Edge-perts
13 0.45395631 48 nips-2005-Context as Filtering
14 0.45227522 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
15 0.45181406 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
16 0.44998744 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface
17 0.44702694 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception
18 0.44390389 169 nips-2005-Saliency Based on Information Maximization
19 0.43210611 111 nips-2005-Learning Influence among Interacting Markov Chains
20 0.4188616 35 nips-2005-Bayesian model learning in human visual perception