nips nips2005 nips2005-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Abstract We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. [sent-5, score-0.53]
2 Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. [sent-6, score-0.415]
3 Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. [sent-7, score-1.228]
4 The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment. [sent-8, score-0.562]
5 These explanations for the woman’s behavior derive from taking the intentional stance: treating her as a rational agent whose behavior is governed by beliefs, desires or other mental states that refer to objects, events, or states of the world [5]. [sent-15, score-1.279]
6 Both adults and infants have been shown to make robust and rapid intentional inferences about agents’ behavior, even from highly impoverished stimuli. [sent-16, score-0.801]
7 , circles) move in ways that convey a strong sense of agency to adults, and that lead to the formation of expectations consistent with goal-directed reasoning in infants [9, 8, 14]. [sent-19, score-0.355]
8 The importance of the intentional stance in interpreting everyday situations, together with its robust engagement even in preverbal infants and with highly simplified perceptual stimuli, suggest that it is a core capacity of human cognition. [sent-20, score-0.836]
9 In this paper we describe a computational framework for modeling intentional reasoning in adults and infants. [sent-21, score-0.59]
10 Interpreting an agent’s behavior via the intentional stance poses a highly underconstrained inference problem: there are typically many configurations of beliefs and desires consistent with any sequence of behavior. [sent-22, score-0.665]
11 We then model intentional reasoning as a Bayesian inference about these hidden variables given observed behavior sequences. [sent-24, score-0.558]
12 By analogy, our analysis of intentional reasoning might be called “inverse planning”, where the observer infers an agent’s intentions, given observations of the agent’s behavior, by inverting a model of how intentions cause behavior. [sent-26, score-0.635]
13 The intentional stance assumes that an agent’s actions depend causally on mental states via the principle of rationality: rational agents tend to act to achieve their desires as optimally as possible, given their beliefs. [sent-27, score-0.976]
14 The standards of “optimal plan” may vary with agent or circumstance: possibilities include achieving goals “as quickly as possible”, “as cheaply . [sent-29, score-0.761]
15 We first review several theoretical accounts of intentional reasoning from the cognitive science and artificial intelligence literatures, along with some motivating empirical findings. [sent-38, score-0.517]
16 2 Empirical studies of intentional reasoning in infants and adults 2. [sent-41, score-0.822]
17 1 Inferring an invariant goal The ability to predict how an agent’s behavior will adapt when environmental circumstances change, such as when an obstacle is inserted or removed, is a critical aspect of intentional reasoning. [sent-42, score-0.845]
18 In the “obstacle” condition, infants were habituated to a sprite (a colored circle) moving (“jumping”) in a curved path over an obstacle to reach another object. [sent-46, score-0.942]
19 The size of the obstacle varied across trials, but the sprite always followed a near-shortest path over the obstacle to reach the other object. [sent-47, score-0.761]
20 In the “no obstacle” group, infants were habituated to the sprite following the same curved “jumping” trajectory to the other object, but without an obstacle blocking its path. [sent-48, score-0.759]
21 Both groups were then presented with the same test conditions, in which the obstacle was placed out of the sprite’s way, and the sprite followed either the old, curved path or a new direct path to the other object. [sent-49, score-0.909]
22 Infants from the “obstacle” group looked longer at the sprite following the unobstructed curved path, which (in the test condition) was now far from the most efficient route to the other object. [sent-50, score-0.372]
23 That is, infants in the “obstacle” condition appeared to interpret the sprite as moving in a rational goal-directed fashion, with the other object as its goal. [sent-52, score-0.539]
24 They expected the sprite to plan a path to the goal that was maximally efficient, subject to environmental constraints when present. [sent-53, score-0.503]
25 Infants in the “no obstacle” group appeared more uncertain about whether the sprite’s movement was actually goal-directed or about what its goal was: was it simply to reach the other object, or something more complex, such as reaching the object via a particular curved path? [sent-54, score-0.257]
26 2 Inferring goals of varying complexity: rational means-ends analysis Gergely et al. [sent-56, score-0.339]
27 [6], expanding on work by Meltzoff [11], showed that infants can infer goals of varying complexity, again by interpreting agents’ behaviors as rational responses to environmental constraints. [sent-57, score-0.701]
28 In two conditions, infants saw an adult demonstrate an unfamiliar complex action: illuminating a light-box by pressing its top with her forehead. [sent-58, score-0.377]
29 Most infants in the “hands free” condition spontaneously performed the head-press action when shown the light-box one week later, but only a few infants in the “hands occupied” condition did so; the others illuminated the light-box simply by pressing it with their hands. [sent-63, score-0.672]
30 3 Inductive inference in intentional reasoning Gergely and colleagues interpret their findings as if infants are reasoning about intentional action in an almost logical fashion, deducing the goal of an agent from its observed behavior, the rationality principle, and other implicit premises. [sent-66, score-2.014]
31 However, from a computational point of view, it is surely oversimplified to think that the intentional stance could be implemented in a deductive system. [sent-67, score-0.5]
32 In contrast, our model posits that intentional reasoning is probabilistic. [sent-69, score-0.517]
33 People’s inferences about an agent’s goal should be graded, reflecting a tradeoff between the prior probability of a candidate goal and its likelihood in light of the agent’s observed behavior. [sent-70, score-0.306]
34 To test whether human intentional reasoning is consistent with a probabilistic account, it is necessary to collect data in greater quantities and with greater precision than infant studies allow. [sent-72, score-0.653]
35 (b) Test paths 1 2 3 # stimuli 4 (c) 1 2 Cond: simple Cond: complex 7 rating (both groups) simple Training paths complex Training cond (a) 5 3 1 1 2 3 # stimuli 4 Figure 1: (a) Training stimuli in complex and simple goal conditions. [sent-75, score-0.897]
36 Observers were told that the mouse had learned to follow a more-or-less direct path to the cheese, regardless of its starting location. [sent-82, score-0.316]
37 In another condition (“complex goal”), observers saw movements suggestive of a more complex, path-dependent goal for the mouse: it first ran directly to a particular location in the middle of the box (the “via-point”), and only then ran to the cheese. [sent-85, score-0.308]
38 Hence both the simple goal (“get to the cheese”) and complex goal (“get to the cheese via point X”) were logically possible interpretations in both conditions. [sent-90, score-0.309]
39 They were asked to rate the probability of the mouse taking one or the other test path using a 1-7 scale: 1 = definitely path 1, 7 = definitely path 2, with intermediate values expressing intermediate degrees of confidence. [sent-93, score-0.698]
40 Observers in the simple-goal condition always leaned towards path 1, the direct route that was consistent with the given prior knowledge. [sent-94, score-0.332]
41 Observers in the complex-goal condition initially leaned just as much towards path 1, but after seeing additional trajectories they became increasingly confident that the mouse would follow path 2 (Fig. [sent-95, score-0.617]
42 3 Previous models of intentional reasoning The above phenomena highlight two capacities than any model of intentional reasoning should capture. [sent-98, score-1.057]
43 Second, inferences about agents’ goals should be probabilistic, and be sensitive to both prior knowledge about likely goals as well as statistical evidence for more complex or less likely goals that better account for observed actions. [sent-100, score-0.935]
44 These two components are clearly not sufficient for a complete account of human intentional reasoning, but most previous accounts do not include even these capacities. [sent-101, score-0.422]
45 Gergely, Csibra and colleagues [7] have proposed an informal (noncomputational) model in which agents are essentially treated as rational planners, but inferences about agents’ goals are purely deductive, without a role for probabilistic expectations or gradations of confidence. [sent-102, score-0.605]
46 A more statistically sophisticated computational framework for inferring goals from behavior has been proposed by [13], but this approach does not incorporate planning capacities. [sent-103, score-0.366]
47 Perhaps closest to how people reason with the intentional stance are methods for inverse reinforcement learning (IRL) [12], or methods for learning an agent’s utility function [2]. [sent-110, score-0.564]
48 Both approaches assume a rational agent who maximizes expected utility, and attempt to infer the agent’s utility function from observations of its behavior. [sent-111, score-0.7]
49 However, the utility functions that people attribute to intentional agents are typically much more structured and constrained than in conventional IRL. [sent-112, score-0.595]
50 In the next section we describe a Bayesian framework for modeling intentional reasoning that is similar in spirit to IRL, but more focused on the kinds of goal structures that are cognitively natural to human adults and infants. [sent-114, score-0.708]
51 We distinguish between environmental state, denoted W , and agent states, denoted S. [sent-120, score-0.567]
52 For simplicity, we will assume that there is exactly one intentional agent in the world, and that the agent’s actions can only affect its own state s ∈ S. [sent-121, score-1.032]
53 Typically, observations of multiple state sequences of the agent are available, and in general each may occur in a separate environment. [sent-123, score-0.599]
54 Let As be the set of actions available to the agent from state s, and let C(a) be the cost to the agent of action a ∈ As . [sent-125, score-1.205]
55 Let P (st+1 |at , st , w) be the distribution over the agent’s next state st+1 , given the current state st , an action at ∈ Ast , and the environmental state w. [sent-126, score-0.461]
56 A simple desire is an end goal: a world state or class of states that the agent will act to bring about. [sent-130, score-0.674]
57 We specify a particular goal space G of simple and complex goals for sprite-worlds in the next subsection. [sent-132, score-0.387]
58 The agent draws goals g ∈ G from a prior distribution P (g|w1:N ), which constrains goals to be feasible in the environments w1:N from which observations of the agent’s behavior are available. [sent-133, score-1.079]
59 The value function can be defined in various ways depending on the domain, task, and agent type. [sent-135, score-0.495]
60 The agent is assumed to choose actions according to a probabilistic policy, with a preference for actions with greater expected increases in value. [sent-137, score-0.733]
61 (1) The parameter β controls how likely the agent is to select the most valuable action. [sent-140, score-0.519]
62 (2) 0:T 0:T We assume that state transition probabilities and action probabilities are conditionally independent given the agent’s goal g, the agent’s current state st , and the environment w. [sent-147, score-0.388]
63 The likelihood of a state sequence s0:T given a goal g and an environment w is computed by marginalizing over possible actions generating state transitions: P (s0:T |g, w) = T −1 t=0 at ∈Ast P (st+1 |at , st , w)P (at |st , g, w). [sent-148, score-0.406]
64 The size of the grid, the location of obstacles, and likely goal points (such as the location of the cheese in our experimental stimuli) are represented by W , and assumed to be known to both the agent and the observer. [sent-165, score-0.721]
65 The agent can also choose to remain still with cost 1. [sent-169, score-0.495]
66 We assume P (st+1 |at , st , w) takes the agent to the desired adjacent grid point deterministically. [sent-170, score-0.598]
67 The set of possible goals G includes both simple and complex goals. [sent-171, score-0.297]
68 Simple goals will just be specific end states in S. [sent-172, score-0.289]
69 While many kinds of complex goals are possible, we assume here that a complex goal is just the combination of a desired end state with a desired means to achieving that end. [sent-173, score-0.612]
70 In our sprite-worlds, we identify “desired means” with a constraint that the agent must pass through an additional specified location enroute, such as the viapoint in the experiment from §2. [sent-174, score-0.519]
71 Because the number of complex goals defined in this way is much larger than the number of simple goals, the likelihood of each complex goal is small relative to the likelihood of individual simple goals. [sent-176, score-0.452]
72 For simplicity, we assume that the agent draws just a single invariant goal g ∈ G from P (g|w1:N ), and we assume that this prior distribution is known to the observer. [sent-179, score-0.634]
73 We define the value of a state Vg,w (s) as the expected total cost to the agent of achieving g while following the policy given in Eq. [sent-181, score-0.63]
74 We assume the desired end-state is absorbing and cost-free, which implies that the agent attempts the stochastic shortest path (with respect to its probabilistic policy) [1]. [sent-183, score-0.754]
75 If g is a complex goal, Vg,w (s) is based on the stochastic shortest path through the specified via-point. [sent-184, score-0.275]
76 3(c), capture the qualitative results of these experiments, showing a large contrast between the straight path and the curved path in the condition with an obstacle, and a relatively small contrast in the condition with no obstacle. [sent-197, score-0.695]
77 In the “no obstacle” condition, our model infers that the agent has a more complex goal, constrained by a via-point. [sent-198, score-0.585]
78 This significantly increases the probability of the curved test path, to the point where the difference between the probability of observing curved and straight paths is negligible. [sent-199, score-0.501]
79 (b) (c) Test paths obst −log(P(Test|Cond)) Training paths no obst Training cond (a) 1 2 1 2 Test: straight Test: curved 20 15 10 5 0 Cond: obst Cond: no obst Figure 3: Inferring an invariant goal. [sent-200, score-1.116]
80 (a) Training input in obstacle and no obstacle conditions. [sent-201, score-0.446]
81 In the obstacle condition, a large dissociation is seen between path 1 and path 2, with path 1 being much more likely. [sent-204, score-0.801]
82 In the no obstacle condition, there is not a large preference for either path 1 or path 2, qualitatively matching Gergely et al. [sent-205, score-0.622]
83 2 Inferring goals of varying complexity: rational means-ends analysis Our next example is inspired by the studies of Gergely et al. [sent-208, score-0.339]
84 In our sprite-world version of the experiment, we varied the amount of evidence for a simple versus a complex goal, by inputting the same three trajectories with and without an obstacle present (Fig. [sent-211, score-0.338]
85 In the “obstacle” condition, the trajectories were all approximately shortest paths to the goal, because the agent was forced to take indirect paths around the obstacle. [sent-213, score-0.812]
86 Thus a more complex goal is inferred, with a path constrained to pass through a via-point. [sent-215, score-0.338]
87 4(b), the model shows a doubledissociation between the probability of the direct path and the curved path through the putative via-point, given each training condition (Fig. [sent-217, score-0.577]
88 (c) obst Test paths 1 2 3 1 2 −log(P(Test|Cond)) (b) Training paths no obst Training cond (a) Test: straight Test: curved 20 15 10 5 0 Cond: obst Cond: no obst Figure 4: Inferring goals of varying complexity. [sent-219, score-1.323]
89 (a) Training input in obstacle and no obstacle conditions. [sent-220, score-0.446]
90 This reflects a preference for the straight path in the first condition, where there is an obstacle to explain the agent’s deflections in the training input, and a preference for the curved path in the second condition, where a complex goal is inferred. [sent-223, score-1.003]
91 3 Inductive inference in intentional reasoning Lastly, we present the results of our model on our own behavioral experiment, first described in §2. [sent-225, score-0.517]
92 These data demonstrated the statistical nature of people’s intentional inferences. [sent-228, score-0.394]
93 5 compares people’s judgments of the probability that the agent takes a particular test path with our model’s predictions. [sent-230, score-0.758]
94 To place model predictions and human judgments on a comparable scale, we fit a sigmoidal psychometric transformation to the computed log posterior odds for the curved test path versus the straight path. [sent-231, score-0.542]
95 The Bayesian model captures the graded shift in people’s expectations in the “complex goal” condition, as evidence accumulates that the agent always seeks to pass through an arbitrary via-point enroute to the end state. [sent-232, score-0.572]
96 Our model captured basic qualitative inferences that even preverbal infants have been shown to perform, as well as more subtle quantitative inferences that adult observers made in a novel experiment. [sent-239, score-0.629]
97 Two future challenges for our computational framework are: representing and learning multiple agent types (e. [sent-240, score-0.495]
98 These extensions will allow us to further test the power of our computational framework, and will support its application to the wide range of intentional inferences that people constantly make in their everyday lives. [sent-244, score-0.587]
99 Taking the intentional stance at 12 months of a o age. [sent-295, score-0.475]
100 A Bayesian model of imitation in infants and robots. [sent-332, score-0.286]
wordName wordTfidf (topN-words)
[('agent', 0.495), ('intentional', 0.394), ('goals', 0.232), ('infants', 0.232), ('obstacle', 0.223), ('cond', 0.19), ('path', 0.183), ('gergely', 0.161), ('curved', 0.143), ('sprite', 0.132), ('reasoning', 0.123), ('paths', 0.12), ('obst', 0.117), ('agents', 0.112), ('rational', 0.107), ('mouse', 0.104), ('inferences', 0.102), ('actions', 0.09), ('goal', 0.09), ('observers', 0.083), ('stance', 0.081), ('st', 0.079), ('adults', 0.073), ('desires', 0.073), ('preverbal', 0.073), ('environmental', 0.072), ('action', 0.072), ('condition', 0.068), ('complex', 0.065), ('cheese', 0.064), ('rationality', 0.064), ('inferring', 0.062), ('csibra', 0.058), ('imitation', 0.054), ('beliefs', 0.053), ('state', 0.053), ('stimuli', 0.053), ('trajectories', 0.05), ('straight', 0.05), ('hands', 0.048), ('policy', 0.048), ('people', 0.046), ('act', 0.046), ('test', 0.045), ('observer', 0.045), ('saw', 0.043), ('utility', 0.043), ('environment', 0.041), ('behavior', 0.041), ('mental', 0.041), ('infant', 0.038), ('adult', 0.037), ('predictions', 0.035), ('judgments', 0.035), ('ratings', 0.035), ('achieving', 0.034), ('preference', 0.033), ('states', 0.032), ('planning', 0.031), ('environments', 0.03), ('infer', 0.03), ('bir', 0.029), ('dissociation', 0.029), ('enroute', 0.029), ('habituated', 0.029), ('irl', 0.029), ('jumping', 0.029), ('leaned', 0.029), ('subgoals', 0.029), ('told', 0.029), ('woman', 0.029), ('human', 0.028), ('interpreting', 0.028), ('displays', 0.028), ('route', 0.028), ('bayesian', 0.027), ('shortest', 0.027), ('colleagues', 0.027), ('sequences', 0.026), ('plan', 0.026), ('demonstrator', 0.025), ('deductive', 0.025), ('probabilistic', 0.025), ('invariant', 0.025), ('end', 0.025), ('observations', 0.025), ('infers', 0.025), ('likely', 0.024), ('location', 0.024), ('desired', 0.024), ('prior', 0.024), ('group', 0.024), ('ast', 0.023), ('rating', 0.023), ('capacities', 0.023), ('graded', 0.023), ('intentions', 0.023), ('odds', 0.023), ('underconstrained', 0.023), ('world', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
2 0.30859083 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
3 0.2465772 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
Author: Eddie Rafols, Anna Koop, Richard S. Sutton
Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.
4 0.20240502 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy
Abstract: We propose a mean-field approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1
5 0.19710892 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
Author: Drew Bagnell, Andrew Y. Ng
Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1
6 0.16754018 153 nips-2005-Policy-Gradient Methods for Planning
7 0.084155068 78 nips-2005-From Weighted Classification to Policy Search
8 0.077324942 144 nips-2005-Off-policy Learning with Options and Recognizers
9 0.074576698 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine
10 0.072346836 148 nips-2005-Online Discovery and Learning of Predictive State Representations
11 0.066214554 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
12 0.063317202 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account
13 0.06316521 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
14 0.056719311 156 nips-2005-Prediction and Change Detection
15 0.054931082 111 nips-2005-Learning Influence among Interacting Markov Chains
16 0.05469666 136 nips-2005-Noise and the two-thirds power Law
17 0.051226042 53 nips-2005-Cyclic Equilibria in Markov Games
18 0.050169867 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection
19 0.049822282 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
20 0.049745064 34 nips-2005-Bayesian Surprise Attracts Human Attention
topicId topicWeight
[(0, 0.16), (1, -0.044), (2, 0.397), (3, -0.004), (4, -0.031), (5, -0.007), (6, 0.073), (7, 0.002), (8, -0.008), (9, -0.033), (10, -0.017), (11, -0.017), (12, 0.003), (13, 0.105), (14, 0.015), (15, -0.052), (16, 0.022), (17, 0.17), (18, -0.063), (19, 0.252), (20, 0.051), (21, -0.083), (22, 0.127), (23, 0.037), (24, -0.021), (25, -0.127), (26, 0.104), (27, -0.008), (28, 0.145), (29, -0.019), (30, 0.001), (31, 0.008), (32, -0.061), (33, -0.032), (34, -0.196), (35, 0.161), (36, 0.014), (37, 0.044), (38, -0.08), (39, 0.106), (40, 0.059), (41, -0.106), (42, 0.139), (43, 0.017), (44, -0.064), (45, 0.072), (46, 0.021), (47, -0.04), (48, -0.085), (49, 0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.97377479 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
2 0.76090389 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
Author: Gabriel Y. Weintraub, Lanier Benkard, Benjamin Van Roy
Abstract: We propose a mean-field approximation that dramatically reduces the computational complexity of solving stochastic dynamic games. We provide conditions that guarantee our method approximates an equilibrium as the number of agents grow. We then derive a performance bound to assess how well the approximation performs for any given number of agents. We apply our method to an important class of problems in applied microeconomics. We show with numerical experiments that we are able to greatly expand the set of economic problems that can be analyzed computationally. 1
3 0.7560668 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
Author: Deepak Verma, Rajesh P. Rao
Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.
4 0.72431833 187 nips-2005-Temporal Abstraction in Temporal-difference Networks
Author: Eddie Rafols, Anna Koop, Richard S. Sutton
Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.
5 0.51381409 153 nips-2005-Policy-Gradient Methods for Planning
Author: Douglas Aberdeen
Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1
6 0.47089487 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning
7 0.31642583 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning
8 0.3151679 144 nips-2005-Off-policy Learning with Options and Recognizers
9 0.30547732 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
10 0.27647513 148 nips-2005-Online Discovery and Learning of Predictive State Representations
11 0.26537192 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
12 0.2643342 156 nips-2005-Prediction and Change Detection
13 0.24733672 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours
14 0.23958395 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
15 0.23908953 35 nips-2005-Bayesian model learning in human visual perception
16 0.23295856 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine
17 0.22876428 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks
18 0.22026473 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning
19 0.21815398 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
20 0.20673852 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account
topicId topicWeight
[(3, 0.029), (10, 0.05), (11, 0.019), (27, 0.083), (31, 0.092), (34, 0.069), (39, 0.021), (55, 0.028), (69, 0.063), (73, 0.029), (88, 0.077), (91, 0.025), (92, 0.298)]
simIndex simValue paperId paperTitle
same-paper 1 0.82208323 36 nips-2005-Bayesian models of human action understanding
Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum
Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.
2 0.77373099 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing
Author: Guido Dornhege, Benjamin Blankertz, Matthias Krauledat, Florian Losch, Gabriel Curio, Klaus-Robert Müller
Abstract: Brain-Computer Interface (BCI) systems create a novel communication channel from the brain to an output device by bypassing conventional motor output pathways of nerves and muscles. Therefore they could provide a new communication and control option for paralyzed patients. Modern BCI technology is essentially based on techniques for the classification of single-trial brain signals. Here we present a novel technique that allows the simultaneous optimization of a spatial and a spectral filter enhancing discriminability of multi-channel EEG single-trials. The evaluation of 60 experiments involving 22 different subjects demonstrates the superiority of the proposed algorithm. Apart from the enhanced classification, the spatial and/or the spectral filter that are determined by the algorithm can also be used for further analysis of the data, e.g., for source localization of the respective brain rhythms.
3 0.70664138 133 nips-2005-Nested sampling for Potts models
Author: Iain Murray, David MacKay, Zoubin Ghahramani, John Skilling
Abstract: Nested sampling is a new Monte Carlo method by Skilling [1] intended for general Bayesian computation. Nested sampling provides a robust alternative to annealing-based methods for computing normalizing constants. It can also generate estimates of other quantities such as posterior expectations. The key technical requirement is an ability to draw samples uniformly from the prior subject to a constraint on the likelihood. We provide a demonstration with the Potts model, an undirected graphical model. 1
4 0.66080004 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
Author: Rory Sayres, David Ress, Kalanit Grill-spector
Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1
5 0.57457197 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface
Author: Le Song, Evian Gordon, Elly Gysels
Abstract: Motor imagery attenuates EEG µ and β rhythms over sensorimotor cortices. These amplitude changes are most successfully captured by the method of Common Spatial Patterns (CSP) and widely used in braincomputer interfaces (BCI). BCI methods based on amplitude information, however, have not incoporated the rich phase dynamics in the EEG rhythm. This study reports on a BCI method based on phase synchrony rate (SR). SR, computed from binarized phase locking value, describes the number of discrete synchronization events within a window. Statistical nonparametric tests show that SRs contain significant differences between 2 types of motor imageries. Classifiers trained on SRs consistently demonstrate satisfactory results for all 5 subjects. It is further observed that, for 3 subjects, phase is more discriminative than amplitude in the first 1.5-2.0 s, which suggests that phase has the potential to boost the information transfer rate in BCIs. 1
6 0.49140415 153 nips-2005-Policy-Gradient Methods for Planning
7 0.48883212 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
8 0.47834784 78 nips-2005-From Weighted Classification to Policy Search
9 0.47230309 144 nips-2005-Off-policy Learning with Options and Recognizers
10 0.46850023 45 nips-2005-Conditional Visual Tracking in Kernel Space
11 0.46371615 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
12 0.45956656 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models
13 0.4595502 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
14 0.45894653 109 nips-2005-Learning Cue-Invariant Visual Responses
15 0.45871085 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
16 0.45862198 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods
17 0.45668662 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
18 0.45664349 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games
19 0.45479318 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
20 0.45427817 48 nips-2005-Context as Filtering